Tag Archives: Git

Improve Git monorepo performance with a file system monitor

Post Syndicated from Jeff Hostetler original https://github.blog/2022-06-29-improve-git-monorepo-performance-with-a-file-system-monitor/

If you have a monorepo, you’ve probably already felt the pain of slow Git commands, such as git status and git add. These commands are slow because they need to search the entire worktree looking for changes. When the worktree is very large, Git needs to do a lot of work.

The Git file system monitor (FSMonitor) feature can speed up these commands by reducing the size of the search, and this can greatly reduce the pain of working in large worktrees. For example, this chart shows status times dropping to under a second on three different large worktrees when FSMonitor is enabled!

In this article, I want to talk about the new builtin FSMonitor git fsmonitor--daemon added in Git version 2.37.0. This is easy to set up and use since it is “in the box” and does not require any third-party tooling nor additional software. It only requires a config change to enable it. It is currently available on macOS and Windows.

To enable the new builtin FSMonitor, just set core.fsmonitor to true. A daemon will be started automatically in the background by the next Git command.

FSMonitor works well with core.untrackedcache, so we’ll also turn it on for the FSMonitor test runs. We’ll talk more about the untracked-cache later.

$ time git status
On branch main
Your branch is up to date with 'origin/main'.

It took 5.25 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working tree clean

real    0m17.941s
user    0m0.031s
sys     0m0.046s

$ git config core.fsmonitor true
$ git config core.untrackedcache true

$ time git status
On branch main
Your branch is up to date with 'origin/main'.

It took 6.37 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working tree clean

real    0m19.767s
user    0m0.000s
sys     0m0.078s

$ time git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

real    0m1.063s
user    0m0.000s
sys     0m0.093s

$ git fsmonitor--daemon status
fsmonitor-daemon is watching 'C:/work/chromium'

_Note that when the daemon first starts up, it needs to synchronize with the state of the index, so the next git status command may be just as slow (or slightly slower) than before, but subsequent commands should be much faster.

In this article, I’ll introduce the new builtin FSMonitor feature and explain how it improves performance on very large worktrees.

How FSMonitor improves performance

Git has a “What changed while I wasn’t looking?” problem. That is, when you run a command that operates on the worktree, such as git status, it has to discover what has changed relative to the index. It does this by searching the entire worktree. Whether you immediately run it again or run it again tomorrow, it has to rediscover all of that same information by searching again. Whether you edit zero, one, or a million files in the mean time, the next git status command has to do the same amount of work to rediscover what (if anything) has changed.

The cost of this search is relatively fixed and is based upon the number of files (and directories) present in the worktree. In a monorepo, there might be millions of files in the worktree, so this search can be very expensive.

What we really need is a way to focus on the changed files without searching the entire worktree.

How FSMonitor works

FSMonitor is a long-running daemon or service process.

  • It registers with the operating system to receive change notification events on files and directories.
  • It adds the pathnames of those files and directories to an in-memory, time-sorted queue.
  • It listens for IPC connections from client processes, such as git status.
  • It responds to client requests for a list of files and directories that have been modified recently.

FSMonitor must continuously watch the worktree to have a complete view of all file system changes, especially ones that happen between Git commands. So it must be a long-running daemon or service process and not associated with an individual Git command instance. And thus, it cannot be a traditional Git hook (child) process. This design does allow it to service multiple (possibly concurrent) Git commands.

FSMonitor Synchronization

FSMonitor has the concept of a “token”:

  • A token is an opaque string defined by FSMonitor and can be thought of as a globally unique sequence number or timestamp.
  • FSMonitor creates a new token whenever file system events happen.
  • FSMonitor groups file system changes into sets by these ordered tokens.
  • A Git client command sends a (previously generated) token to FSMonitor to request the list of pathnames that have changed, since FSMonitor created that token.
  • FSMonitor includes the current token in every response. The response contains the list of pathnames that changed between the sent and received tokens.

git status writes the received token into the index with other FSMonitor data before it exits. The next git status command reads the previous token (along with the other FSMonitor data) and asks FSMonitor what changed since the previous token.

Earlier, I said a token is like a timestamp, but it also includes other fields to prevent incomplete responses:

  • The FSMonitor process id (PID): This identifies the daemon instance that created the token. If the PID in a client’s request token does not match the currently running daemon, we must assume that the client is asking for data on file system events generated before the current daemon instance was started.
  • A file system synchronization id (SID): This identifies the most recent synchronization with the file system. The operating system may drop file system notification events during heavy load. The daemon itself may get overloaded, fall behind, and drop events. Either way, events were dropped, and there is a gap in our event data. When this happens, the daemon must “declare bankruptcy” and (conceptually) restart with a new SID. If the SID in a client’s request token does not match the daemon’s curent SID, we must assume that the client is asking for data spanning such a resync.

In both cases, a normal response from the daemon would be incomplete because of gaps in the data. Instead, the daemon responds with a trivial (“assume everything was changed”) response and a new token. This will cause the current Git client command to do a regular scan of the worktree (as if FSMonitor were not enabled), but let future client commands be fast again.

Types of files in your worktree

When git status examines the worktree, it looks for tracked, untracked, and ignored files.

Tracked files are files under version control. These are files that Git knows about. These are files that Git will create in your worktree when you do a git checkout. The file in the worktree may or may not match the version listed in the index. When different, we say that there is an unstaged change. (This is independent of whether the staged version matches the version referenced in the HEAD commit.)

Untracked files are just that: untracked. They are not under version control. Git does not know about them. They may be temporary files or new source files that you have not yet told Git to care about (using git add).

Ignored files are a special class of untracked files. These are usually temporary files or compiler-generated files. While Git will ignore them in commands like git add, Git will see them while searching the worktree and possibly slow it down.

Normally, git status does not print ignored files, but we’ll turn it on for this example so that we can see all four types of files.

$ git status --ignored
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   README

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   README
    modified:   main.c

Untracked files:
  (use "git add <file>..." to include in what will be committed)

Ignored files:
  (use "git add -f <file>..." to include in what will be committed)

The expensive worktree searches

During the worktree search, Git treats tracked and untracked files in two distinct phases. I’ll talk about each phase in detail in later sections.

  1. In “refresh_index,” Git looks for unstaged changes. That is, changes to tracked files that have not been staged (added) to the index. This potentially requires looking at each tracked file in the worktree and comparing its contents with the index version.
  2. In “untracked,” Git searches the worktree for untracked files and filters out tracked and ignored files. This potentially requires completely searching each subdirectory in the worktree.

There is a third phase where Git compares the index and the HEAD commit to look for staged changes, but this phase is very fast, because it is inspecting internal data structures that are designed for this comparision. It avoids the significant number of system calls that are required to inspect the worktree, so we won’t worry about it here.

A detailed example

The chart in the introduction showed status times before and after FSMonitor was enabled. Let’s revisit that chart and fill in some details.

I collected performance data for git status on worktrees from three large repositories. There were no modified files, and git status was clean.

  1. The Chromium repository contains about 400K files and 33K directories.
  2. A synthetic repository containing 1M files and 111K directories.
  3. A synthetic repository containing 2M files and 111K directories.

Here we can see that when FSMonitor is not present, the commands took from 17 to 85 seconds. However, when FSMonitor was enabled the commands took less than 1 second.

Each bar shows the total run time of the git status commands. Within each bar, the total time is divided into parts based on performance data gathered by Git’s trace2 library to highlight the important or expensive steps within the commands.

Worktree Files refresh_index

with Preload


without Untracked-Cache

Remainder Total
Chromium 393K 12.3s 5.1s 0.16s 17.6s
Synthetic 1M 1M 30.2s 10.5s 0.36s 41.1s
Synthetic 2M 2M 73.2s 11.2s 0.64s 85.1s

The top three bars are without FSMonitor. We can see that most of the time was spent in the refresh_index and untracked columns. I’ll explain what these are in a minute. In the remainder column, I’ve subtracted those two from the total run time. This portion barely shows up on these bars, so the key to speeding up git status is to attack those two phases.

The bottom three bars on the above chart have FSMonitor and the untracked-cache enabled. They show a dramatic performance improvement. On this chart these bars are barely visible, so let’s zoom in on them.

This chart rescales the FSMonitor bars by 100X. The refresh_index and untracked columns are still present but greatly reduced thanks to FSMonitor.

Worktree Files refresh_index

with FSMonitor


with FSMonitor

and Untracked-Cache

Remainder Total
Chromium 393K 0.024s 0.519s 0.284s 0.827s
Synthetic 1M 1M 0.050s 0.112s 0.428s 0.590s
Synthetic 2M 2M 0.096s 0.082s 0.572s 0.750s

This is bigger than just status

So far I’ve only talked about git status, since it is the command that we probably use the most and are always thinking about when talking about performance relative to the state and size of the worktree. But it is just one of many affected commands:

  • git diff does the same search, but uses the changed files it finds to print a difference in the worktree and your index.
  • git add . does the same search, but it stages each changed file it finds.
  • git restore and git checkout do the same search to decide the files to be replaced.

So, for simplicity, I’ll just talk about git status, but keep in mind that this approach benefits many other commands, since the cost of actually staging, overwriting, or reporting the change is relatively trivial by comparison — the real performance cost in these commands (as the above charts show) is the time it takes to simply find the changed files in the worktree.

Phase 1: refresh_index

The index contains an “index entry” with information for each tracked file. The git ls-files command can show us what that list looks like. I’ll truncate the output to only show a couple of files. In a monorepo, this list might contain millions of entries.

$ git ls-files --stage --debug
100644 7ce4f05bae8120d9fa258e854a8669f6ea9cb7b1 0   README.md
  ctime: 1646085519:36302551
  mtime: 1646085519:36302551
  dev: 16777220 ino: 180738404
  uid: 502  gid: 20
  size: 3639    flags: 0
100644 5f1623baadde79a0771e7601dcea3c8f2b989ed9 0   Makefile
  ctime: 1648154224:994917866
  mtime: 1648154224:994917866
  dev: 16777221 ino: 182328550
  uid: 502  gid: 20
  size: 110149  flags: 0

Scanning tracked files for unstaged changes

Let’s assume at the beginning of refresh_index that all index entries are “unmarked” — meaning that we don’t know yet whether or not the worktree file contains an unstaged change. And we “mark” an index entry when we know the answer (either way).

To determine if an individual tracked file has an unstaged change, it must be “scanned”. That is, Git must read, clean, hash the current contents of the file, and compare the computed hash value with the hash value stored in the index. If the hashes are the same, we mark the index entry as “valid”. If they are different, we mark it as an unstaged change.

In theory, refresh_index must repeat this for each tracked file in the index.

As you can see, each individual file that we have to scan will take time and if we have to do a “full scan”, it will be very slow, especially if we have to do it for millions of files. For example, on the Chromium worktree, when I forced a full scan it took almost an hour.

Worktree Files Full Scan
Chromium 393K 3072s

refresh_index shortcuts

Since doing a full scan of the worktree is so expensive, Git has developed various shortcuts to avoid scanning whenever possible to increase the performance of refresh_index.

For discussion purposes, I’m going to describe them here as independent steps rather than somewhat intertwined steps. And I’m going to start from the bottom, because the goal of each shortcut is to look at unmarked index entries, mark them if they can, and make less work for the next (more expensive) step. So in a perfect world, the final “full scan” would have nothing to do, because all of the index entries have already been marked, and there are no unmarked entries remaining.

In the above chart, we can see the cummulative effects of these shortcuts.

Shortcut: refresh_index with lstat()

The “lstat() shortcut” was created very early in the Git project.

To avoid actually scanning every tracked file on every git status command, Git relies on a file’s last modification time (mtime) to tell when a file was last changed. File mtimes are updated when files are created or edited. We can read the mtime using the lstat() system call.

When Git does a git checkout or git add, it writes each worktree file’s current mtime into its index entry. These serve as the reference mtimes for future git status commands.

Then, during a later git status, Git checks the current mtime against the reference mtime (for each unmarked file). If they are identical, Git knows that the file content hasn’t changed and marks the index entry valid (so that the next step will avoid it). If the mtimes are different, this step leaves the index entry unmarked for the next step.

Worktree Files refresh_index with lstat()
Chromium 393K 26.9s
Synthetic 1M 1M 66.9s
Synthetic 2M 2M 136.6s

The above table shows the time in seconds taken to call lstat() on every file in the worktree. For the Chromium worktree, we’ve cut the time of refresh_index from 50 minutes to 27 seconds.

Using mtimes is much faster than always scanning each file, but Git still has to lstat() every tracked file during the search, and that can still be very slow when there are millions of files.

In this experiment, there were no modifications in the worktree, and the index was up to date, so this step marked all of the index entries as valid and the “scan all unmarked” step had nothing to do. So the time reported here is essentially just the time to call lstat() in a loop.

This is better than before, but even though we are only doing an lstat(), git status is still spending more than 26 seconds in this step. We can do better.

Shortcut: refresh_index with preload

The core.preloadindex config option is an optional feature in Git. The option was introduced in version 1.6 and was enabled by default in 2.1.0 on platforms that support threading.

This step partitions the index into equal-sized chunks and distributes it to multiple threads. Each thread does the lstat() shortcut on their partition. And like before, index entries with different mtimes are left unmarked for the next step in the process.

The preload step does not change the amount of file scanning that we need to do in the final step, it just distributes the lstat() calls across all of your cores.

Worktree Files refresh_index with Preload
Chromium 393K 12.3s
Synthetic 1M 1M 30.2s
Synthetic 2M 2M 73.2s

With the preload shortcut git status is about twice as fast on my 4-core Windows laptop, but it is still expensive.

Shortcut: refresh_index with FSMonitor

When FSMonitor is enabled:

  1. The git fsmonitor--daemon is started in the background and listens for file system change notification events from the operating system for files within the worktree. This includes file creations, deletions, and modifications. If the daemon gets an event for a file, that file probably has an updated mtime. Said another way, if a file mtime changes, the daemon will get an event for it.
  2. The FSMonitor index extension is added to the index to keep track of FSMonitor and git status data between git status commands. The extension contains an FSMonitor token and a bitmap listing the files that were marked valid by the previous git status command (and relative to that token).
  3. The next git status command will use this bitmap to initialize the marked state of the index entries. That is, the previous Git command saved the marked state of the index entries in the bitmap and this command restores them — rather than initializing them all as unmarked.
  4. It will then ask the daemon for a list of files that have had file system events since the token and unmark each of them. FSMonitor tells us the exact set of files that have been modified in some way since the last command, so those are the only files that we should need to visit.

At this point, all of the unchanged files should be marked valid. Only files that may have changed should be unmarked. This sets up the next shortcut step to have very little to do.

Worktree Files Query FSMonitor refresh_index with FSMonitor
Chromium 393K 0.017s 0.024s
Synthetic 1M 1M 0.002s 0.050s
Synthetic 2M 2M 0.002s 0.096s

This table shows that refresh_index is now very fast since we don’t need to any searching. And the time to request the list of files over IPC is well worth the complex setup.

Phase 2: untracked

The “untracked” phase is a search for anything in the worktree that Git does not know about. These are files and directories that are not under version control. This requires a full search of the worktree.

Conceptually, this looks like:

  1. A full recursive enumeration of every directory in the worktree.
  2. Build a complete list of the pathnames of every file and directory within the worktree.
  3. Take each found pathname and do a binary search in the index for a corresponding index entry. If one is found, the pathname can be omitted from the list, because it refers to a tracked file.
    1. On case insensitive systems, such as Windows and macOS, a case insensitive hash table must be constructed from the case sensitive index entries and used to lookup the pathnames instead of the binary search.
  4. Take each remaining pathname and apply .gitignore pattern matching rules. If a match is found, then the pathname is an ignored file and is omitted from the list. This pattern matching can be very expensive if there are lots of rules.
  5. The final resulting list is the set of untracked files.

This search can be very expensive on monorepos and frequently leads to the following advice message:

$ git status
On branch main
Your branch is up to date with 'origin/main'.

It took 5.12 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working tree clean

Normally, the complete discovery of the set of untracked files must be repeated for each command unless the [core.untrackedcache](https://git-scm.com/docs/git-config#Documentation/git-config.txt-coreuntrackedCache) feature is enabled.

The untracked-cache

The untracked-cache feature adds an extension to the index that remembers the results of the untracked search. This includes a record for each subdirectory, its mtime, and a list of the untracked files within it.

With the untracked-cache enabled, Git still needs to lstat() every directory in the worktree to confirm that the cached record is still valid.

If the mtimes match:

  • Git avoids calling opendir() and readdir() to enumerate the files within the directory,
  • and just uses the existing list of untracked files from the cache record.

If the mtimes don’t match:

  • Git needs to invalidate the untracked-cache entry.
  • Actually open and read the directory contents.
  • Call lstat() on each file or subdirectory within the directory to determine if it is a file or directory and possibly invalidate untracked-cache entries for any subdirectories.
  • Use the file pathname to do tracked file filtering.
  • Use the file pathname to do ignored file filtering
  • Update the list of untracked files in the untracked-cache entry.

How FSMonitor helps the untracked-cache

When FSMonitor is also enabled, we can avoid the lstat() calls, because FSMonitor tells us the set of directories that may have an updated mtime, so we don’t need to search for them.

Worktree Files Untracked

without Untracked-Cache


with Untracked-Cache


with Untracked-Cache

and FSMonitor

Chromium 393K 5.1s 2.3s 0.83s
Synthetic 1M 1M 10.5s 6.3s 0.59s
Synthetic 2M 2M 11.2s 6.6s 0.75s

By itself, the untracked-cache feature gives roughly a 2X speed up in the search for untracked files. Use both the untracked-cache and FSMonitor, and we see a 10X speedup.

A note about ignored files

You can improve Git performance by not storing temporary files, such as compiler intermediate files, inside your worktree.

During the untracked search, Git first eliminates the tracked files from the candidate untracked list using the index. Git then uses the .gitignore pattern matching rules to eliminate the ignored files. Git’s performance will suffer if there are many rules and/or many temporary files.

For example, if there is a *.o for every source file and they are stored next to their source files, then every build will delete and recreate one or more object files and cause the mtime on their parent directories to change. Those mtime changes will cause git status to invalidate the corresponding untracked-cache entries and have to re-read and re-filter those directories — even if no source files actually changed. A large number of such temporary and uninteresting files can greatly affect the performance of these Git commands.

Keeping build artifacts out of your worktree is part of the philosophy of the Scalar Project. Scalar introduced Git tooling to help you keep your worktree in <repo-name>/src/ to make it easier for you to put these other files in <repo-name>/bin/ or <repo-name>/packages/, for example.

A note about sparse checkout

So far, we’ve talked about optimizations to make Git work smarter and faster on worktree-related operations by caching data in the index and in various index extensions. Future commands are faster, because they don’t have to rediscover everything and therefore can avoid repeating unnecessary or redundant work. But we can only push that so far.

The Git sparse checkout feature approaches worktree performance from another angle. With it, you can ask Git to only populate the files that you need. The parts that you don’t need are simply not present. For example, if you only need 10% of the worktree to do your work, why populate the other 90% and force Git to search through them on every command?

Sparse checkout speeds the search for unstaged changes in refresh_index because:

  1. Since the unneeded files are not actually present on disk, they cannot have unstaged changes. So refresh_index can completely ignore them.
  2. The index entries for unneeded files are pre-marked during git checkout with the skip-worktree bit, so they are never in an “unmarked” state. So those index entries are excluded from all of the refresh_index loops.

Sparse checkout speeds the search for untracked files because:

  1. Since Git doesn’t know whether a directory contains untracked files until it searches it, the search for untracked files must visit every directory present in the worktree. Sparse checkout lets us avoid creating entire sub-trees or “cones” from the worktree. So there are fewer directories to visit.
  2. The untracked-cache does not need to create, save, and restore untracked-cache entries for the unpopulated directories. So reading and writing the untracked-cache extension in the index is faster.

External file system monitors

So far we have only talked about Git’s builtin FSMonitor feature. Clients use the simple IPC interface to communicate directly with git fsmonitor--daemon over a Unix domain socket or named pipe.

However, Git added support for an external file system monitor in version 2.16.0 using the core.fsmonitor hook. Here, clients communicate with a proxy child helper process through the hook interface, and it communicates with an external file system monitor process.

Conceptually, both types of file system monitors are identical. They include a long-running process that listens to the file system for changes and are able to respond to client requests for a list of recently changed files and directories. The response from both are used identically to update and modify the refresh_index and untracked searches. The only difference is in how the client talks to the service or daemon.

The original hook interface was useful, because it allowed Git to work with existing off-the-shelf tools and allowed the basic concepts within Git to be proven relatively quickly, confirm correct operation, and get a quick speed up.

Hook protocol versions

The original 2.16.0 version of the hook API used protocol version 1. It was a timestamp-based query. The client would send a timestamp value, expressed as nanoseconds since January 1, 1970, and expect a list of the files that had changed since that timestamp.

Protocol version 1 has several race conditions and should not be used anymore. Protocol version 2 was added in 2.26.0 to address these problems.

Protocol version 2 is based upon opaque tokens provided by the external file system monitor process. Clients make token-based queries that are relative to a previously issued token. Instead of making absolute requests, clients ask what has changed since their last request. The format and content of the token is defined by the external file system monitor, such as Watchman, and is treated as an opaque string by Git client commands.

The hook protocol is not used by the builtin FSMonitor.

Using Watchman and the sample hook script

Watchman is a popular external file system monitor tool and a Watchman-compatible hook script is included with Git and copied into new worktrees during git init.

To enable it:

  1. Install Watchman on your system.
  2. Tell Watchman to watch your worktree:
$ watchman watch .
    "version": "2022.01.31.00",
    "watch": "/Users/jeffhost/work/chromium",
    "watcher": "fsevents"

  1. Install the sample hook script to teach Git how to talk to Watchman:
$ cp .git/hooks/fsmonitor-watchman.sample .git/hooks/query-watchman

  1. Tell Git to use the hook:
$ git config core.fsmonitor .git/hooks/query-watchman

Using Watchman with a custom hook

The hook interface is not limited to running shell or Perl scripts. The included sample hook script is just an example implementation. Engineers at Dropbox described how they were able to speed up Git with a custom hook executable.

Final Remarks

In this article, we have seen how a file system monitor can speed up commands like git status by solving the “discovery” problem and eliminating the need to search the worktree for changes in every command. This greatly reduces the pain of working with monorepos.

This feature was created in two efforts:

  1. First, Git was taught to work with existing off-the-shelf tools, like Watchman. This allowed the basic concepts to be proven relatively quickly. And for users who already use Watchman for other purposes, it allows Git to efficiently interoperate with them.
  2. Second, we brought the feature “in the box” to reduce the setup complexity and third-party dependencies, which some users may find useful. It also lets us consider adding Git-specific features that a generic monitoring tool might not want, such as understanding ignored files and omitting them from the service’s response.

Having both options available lets users choose the best solution for their needs.

Regardless of which type of file system monitor you use, it will help make your monorepos more usable.

Highlights from Git 2.37

Post Syndicated from Taylor Blau original https://github.blog/2022-06-27-highlights-from-git-2-37/

The open source Git project just released Git 2.37, with features and bug fixes from over 75 contributors, 20 of them new. We last caught up with you on the latest in Git back when 2.36 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Before we get into the details of Git 2.37.0, we first wanted to let you know that Git Merge is returning this September. The conference features talks, workshops, and more all about Git and the Git ecosystem. There is still time to submit a proposal to speak. We look forward to seeing you there!

A new mechanism for pruning unreachable objects

In Git, we often talk about classifying objects as either “reachable” or “unreachable”. An object is “reachable” when there is at least one reference (a branch or a tag) from which you can start an object walk (traversing from commits to their parents, from trees into their sub-trees, and so on) and end up at your destination. Similarly, an object is “unreachable” when no such reference exists.

A Git repository needs all of its reachable objects to ensure that the repository is intact. But it is free to discard unreachable objects at any time. And it is often desirable to do just that, particularly when many unreachable objects have piled up, you’re running low on disk space, or similar. In fact, Git does this automatically when running garbage collection.

But observant readers will notice the gc.pruneExpire configuration. This setting defines a “grace period” during which unreachable objects which are not yet old enough to be removed from the repository completely are left alone. This is done in order to mitigate a race condition where an unreachable object that is about to be deleted becomes reachable by some other process (like an incoming reference update or a push) before then being deleted, leaving the repository in a corrupt state.

Setting a small, non-zero grace period makes it much less likely to encounter this race in practice. But it leads us to another problem: how do we keep track of the age of the unreachable objects which didn’t leave the repository? We can’t pack them together into a single packfile; since all objects in a pack share the same modification time, updating any object drags them all forward. Instead, prior to Git 2.37, each surviving unreachable object was written out as a loose object, and the mtime of the individual objects was used to store their age. This can lead to serious problems when there are many unreachable objects which are too new and can’t be pruned.

Git 2.37 introduces a new concept, cruft packs, which allow unreachable objects to be stored together in a single packfile by writing the ages of individual objects in an auxiliary table stored in an *.mtimes file alongside the pack.

While cruft packs don’t eliminate the data race we described earlier, in practice they can help make it much less likely by allowing repositories to prune with a much longer grace period, without worrying about the potential to create many loose objects. To try it out yourself, you can run:

$ git gc --cruft --prune=1.day.ago

and notice that your $GIT_DIR/objects/pack directory will have an additional .mtimes file, storing the ages of each unreachable object written within the last 24 hours

$ ls -1 .git/objects/pack

There’s a lot of detail we haven’t yet covered on cruft packs, so expect a more comprehensive technical overview in a separate blog post soon.


A builtin filesystem monitor for Windows and macOS

As we have discussed often before, one of the factors that significantly impact Git’s performance is the size of your working directory. When you run git status, for example, Git has to crawl your entire working directory (in the worst case) in order to figure out which files have been modified.

Git has its own cached understanding of the filesystem to avoid this whole-directory traversal in many cases. But it can be expensive for Git to update its cached understanding of the filesystem with the actual state of the disk while you work.

In the past, Git has made it possible to integrate with tools like Watchman via a hook, making it possible to replace Git’s expensive refreshing process with a long-running daemon which tracks the filesystem state more directly.

But setting up this hook and installing a third-party tool can be cumbersome. In Git 2.37, this functionality is built into Git itself on Windows and macOS, removing the need to install an external tool and configure the hook.

You can enable this for your repository by enabling the core.fsmonitor config setting.

$ git config core.fsmonitor true

After setting up the config, an initial git status will take the normal amount of time, but subsequent commands will take advantage of the monitored data and run significantly faster.

The full implementation is impossible to describe completely in this post. Interested readers can follow along later this week with a blog post written by Jeff Hostetler for more information. We’ll be sure to add a link here when that post is published.

[source, source, source, source]

The sparse index is ready for wide use

We previously announced Git’s sparse index feature, which helps speed up Git commands when using the sparse-checkout feature in a large repository.

In case you haven’t seen our earlier post, here’s a brief refresher. Often when working in an extremely large repository, you don’t need the entire contents of your repository present locally in order to contribute. For example, if your company uses a single monorepo, you may only be interested in the parts of that repository that correspond to the handful of products you work on.

Partial clones make it possible for Git to only download the objects that you care about. The sparse index is an equally important component of the equation. The sparse index makes it possible for the index (a key data structure which tracks the content of your next commit, which files have been modified, and more) to only keep track of the parts of your repository that you’re interested in.

When we originally announced the sparse index, we explained how different Git subcommands would have to be updated individually to take advantage of the sparse index. With Git 2.37.0, all of those integrations are now included in the core Git project and available to all users.

In this release, the final integrations were for git show, git sparse-checkout, and git stash. In particular, git stash has the largest performance boost of all of the integrations so far because of how the command reads and writes indexes multiple times in a single process, achieving a near 80% speed-up in certain cases (though see this thread for all of the details).

[source, source, source]

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37, or any previous version in the Git repository.


Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

  • Speaking of sparse checkouts, this release deprecates the non---cone-mode style of sparse checkout definitions.

    For the uninitiated, the git sparse-checkout command supports two kinds of patterns which dictate which parts of your repository should be checked out: “cone” mode, and “non-cone” mode. The latter, which allows specifying individual files with a .gitignore-style syntax, can be confusing to use correctly, and has performance problems (namely that in the worst case all patterns must try to be matched with all files, leading to slow-downs). Most importantly, it is incompatible with the sparse-index, which brings the performance enhancements of using a sparse checkout to all of the Git commands you’re familiar with.

    For these reasons (and more!), the non---cone mode style of patterns is discouraged, and users are instead encouraged to use --cone mode.


  • In our highlights from the last Git release, we talked about more flexible fsync configuration, which made it possible to more precisely define what files Git would explicitly synchronize with fsync() and what strategy it would use to do that synchronization.

    This release brings a new strategy to the list supported by core.fsyncMethod: “batch”, which can provide significant speed-ups on supported filesystems when writing many individual files. This new mode works by staging many updates to the disk’s writeback cache before preforming a single fsync() causing the disk to flush its writeback cache. Files are then atomically moved into place, guaranteeing that they are fsync()-durable by the time they enter the object directory.

    For now, this mode only supports batching loose object writes, and will only be enabled when core.fsync includes the loose-objects value. On a synthetic test of adding 500 files to the repository with git add (each resulting in a new loose object), the new batch mode imposes only a modest penalty over not fsyncing at all.

    On Linux, for example, adding 500 files takes .06 seconds without any fsync() calls, 1.88 seconds with an fsync() after each loose object write, and only .15 seconds with the new batched fsync(). Other platforms display similar speed-ups, with a notable example being Windows, where the numbers are .35 seconds, 11.18 seconds, and just .41 seconds, respectively.


  • If you’ve ever wondered, “what’s changed in my repository since yesterday?”, one way you can figure that out is with the --since option, which is supported by all standard revision-walking commands, like log and rev-list.

    This option works by starting with the specified commits, and walking recursively along each commit’s parents, stopping the traversal as soon as it encounters a commit older than the --since date. But in occasional circumstances (particularly when there is) clock skew this can produce confusing results.

    For example, suppose you have three commits, C1, C2, and C3, where C2 is the parent of C3, and C1 is the parent of C2. If both C1 and C3 were written in the last hour, but C2 is a day old (perhaps because the committer’s clock is running slow), then a traversal with --since=1.hour.ago will only show C3, since seeing C2 causes Git to halt its traversal.

    If you expect your repository’s history has some amount of clock skew, then you can use --since-as-filter in place of --since, which only prints commits newer than the specified date, but does not halt its traversal upon seeing an older one.


  • If you work with partial clones, and have a variety of different Git remotes, it can be confusing to remember which partial clone filter is attached to which remote.

    Even in a simple example, trying to remember what object filter was used to clone your repository requires this incantation:

    $ git config remote.origin.partialCloneFilter

    In Git 2.37, you can now access this information much more readily behind the -v flag of git remote, like so:

    $ git remote -v
    origin    [email protected]:git/git.git (fetch) [tree:0]
    origin    [email protected]:git/git.git (push)

    Here, you can easily see between the square-brackets that the remote origin uses a tree:0 filter.

    This work was contributed by Abhradeep Chakraborty, a Google Summer of Code student, who is one of three students participating this year and working on Git.


  • Speaking of remote configuration, Git 2.37 ships with support for warning or exiting when it encounters plain-text credentials stored in your configuration with the new transfer.credentialsInUrl setting.

    Storing credentials in plain-text in your repository’s configuration is discouraged, since it forces you to ensure you have appropriately restrictive permissions on the configuration file. Aside from storing the data unencrypted at rest, Git often passes the full URL (including credentials) to other programs, exposing them on systems where other processes have access to arguments list of sensitive processes. In most cases, it is encouraged to use Git’s credential mechanism, or tools like GCM.

    This new setting allows Git to either ignore or halt execution when it sees one of these credentials by setting the transfer.credentialsInUrl to “warn” or “die” respectively. The default, “allow”, does nothing.

    [source, source]

  • If you’ve ever used git add -p to stage the contents of your working tree incrementally, then you may be familiar with git add‘s “interactive mode”, or git add -i, of which git add -p is a sub-mode.

    In addition to “patch” mode, git add -i supports “status”, “update”, “revert”, “add untracked”, “patch”, and “diff”. Until recently, this mode of git add -i was actually written in Perl. This command has been the most recent subject of a long-running effort to port Git commands written in Perl into C. This makes it possible to use Git’s libraries without spawning sub-processes, which can be prohibitively expensive on certain platforms.

    The C reimplementation of git add -i has shipped in releases of Git as early as v2.25.0. In more recent versions, this reimplementation has been in “testing” mode behind an opt-in configuration. Git 2.37 promotes the C reimplementation by default, so Windows users should notice a speed-up when using git add -p.

    [source, source, source, source, source, source, source]

  • Last but not least, there is a lot of exciting work going on for Git developers, too, like improving the localization workflow, improving CI output with GitHub Actions, and reducing memory leaks in internal APIs.

    If you’re interested in contributing to Git, now is a more exciting time than ever to start. Check out this guide for some tips on getting started.

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37 or any previous version in the Git repository.

Improving Git push times through faster server side hooks

Post Syndicated from Carlos Martín Nieto original https://github.blog/2022-04-21-improving-git-push-times-through-faster-server-side-hooks/

At GitHub, we relentlessly pursue performance. Join me now for the tale of how we dropped a P99 time by 95% on code that runs for every single Git push operation.

Pre-receive hook execution time

Every time you push to GitHub, we run a set of checks to validate your push before accepting it. If you ever tried to push an object larger than 100MB, you are already familiar with them, as these pre-receive hooks contain that logic. Similarly, they do other checks, such as verifying that LFS objects have been successfully uploaded. These hooks help keep our servers healthy and improve the user experience.

We recently rewrote these hooks from their original Ruby implementation into Go. This rewrite was something we had in mind for a while, but what really sold us on the effort was the potential performance improvement.

Today, we’ll talk about the history of these hooks, how we discovered that the performance was problematic, and how we went about safely replacing them.

How did we get here?

We created the first hook in 2013 to warn users that a repository was renamed. The only action was a database check for a previous name and to send a warning to the user to update their remote URL. At the time, almost all of GitHub was part of one Ruby on Rails application, so it was the logical choice for hooks as well. As time passed, more and more functionality was added to the hooks, requiring additional configuration, exception reporting, and logging.

This meant that hooks imported the same dependencies as the Ruby application. Over time, the number of dependencies, and therefore startup time, only increased. In a Rails application, these dependencies are loaded only once at startup time, and then each request has them available, making the startup time not important for the user experience. However, these hooks are run as subprocesses underneath the Git executable, so they are loaded for each request, making the startup time critical to performance. When we investigated, loading these dependencies took a rather long time. Hooks took about 880 milliseconds to execute on average, and almost all of that time was spent loading dependencies. In addition, there are two sets of hooks: one with the new data under quarantine and a second set once the data is available in the repository. Especially with this double execution, this startup time significantly affects each push. An empty push could take more than two seconds, which was unacceptable.

Why rewrite?

Since the performance issues were related to startup time, we had a few options. We could reduce the number of dependencies, we could change the architecture so that hooks only started up once, or we could rewrite the hooks to run independently of the monolith. Rewrites and changing the architecture carry risk, so we tried the simplest alternative first.

Loading fewer dependencies while staying within the Rails monolith proved quite tricky. There were a lot of dependencies (more than 450 gems leading to over 1,000 require calls), and they were all quite tangled up in the app’s configuration, because they were not designed to be used outside of the GitHub Rails monolith. However, careful use of the debugger and strace revealed a few outliers that we could avoid loading when running the hooks. This removal dropped 350-400 milliseconds from the startup time.

While this was already a decent improvement for a small tweak, the startup time was still quite slow, and we weren’t satisfied yet. Additionally, new dependencies are frequently added to the Rails application, which means that the startup time would creep up again over time even if our hook code did not change.

How did we rewrite?

We could not ignore the configuration from the Rails app as that is how we know how to connect to the database, send stats, etc. Some of that could be duplicated at the risk of having two parallel configuration paths that would almost certainly end up diverging.

To pass configuration along that only the app knows about, we were able to use an existing mechanism, which was already in use to pass along information, such as the name of a repository and whether it is over its quota. This comes alongside other information necessary to perform updates so it gets called for every push, and adding some more data there adds very little overhead. We identified the information necessary to perform the checks and added this information.

For the past few years, the Git Systems Team has been extracting more and more of our service code from the Rails monolith and rewriting it as a dedicated service written in Go. Based on this experience with Go, and given what we learned about the Ruby hooks, moving the hooks into this Go service seemed like a natural fit as well. Just extracting the hooks to run independently of the Rails apps would have removed most of the boot time, but Go gives us the last few milliseconds and lets us make them part of the service in which the backend code increasingly lives. We expected such a significant rewrite to be worth the risk, because afterwards the hooks should be much faster.

Further, as we commonly do for high-impact changes, we put these rewritten hooks behind a feature flag. This gave us the ability to enable them for individual repositories or groups of them. We started with a few internal GitHub repositories to confirm the effect in production.

The results were so impressive that we had to double check that the hooks were still running. It was hard to distinguish between really fast hooks and completely disabled hooks. The median time was now 10ms, compared to roughly 880ms when we started the project. This made pushes noticeably faster for everyone. We even got unprompted questions about whether pushing had become faster after someone noticed it on their own.

Lessons learned

This is a project we had in mind for a long time. We had wanted to rewrite these hooks outside of the monolith to separate our area of responsibility better. However, merely having a better architecture often isn’t enough to make something a business priority. By tying the change to its impact on users we could prioritize this work. We came away with the dual benefits of a much better user experience and an architectural improvement.

This change has now been live on github.com for a couple of months, and has been shipped in GHES 3.4, so everyone now saves some time pushing to their GitHub repositories.


Highlights from Git 2.36

Post Syndicated from Taylor Blau original https://github.blog/2022-04-18-highlights-from-git-2-36/

The open source Git project just released Git 2.36, with features and bug fixes from over 96 contributors, 26 of them new. We last caught up with you on the latest in Git back when 2.35 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Review merge conflict resolution with –remerge-diff

Returning readers may remember our coverage of merge ort, the from-scratch rewrite of Git’s recursive merge engine.

This release brings another new feature powered by ort, which is the --remerge-diff option. To explain what --remerge-diff is and why you might be excited about it, let’s take a step back and talk about git show.

When given a commit git show will print out that commit’s log message as well as its diff. But it has slightly different behavior when given a merge commit, especially one that had merge conflicts. If you’ve ever passed a conflicted merge to git show, you might be familiar with this output:

If you look closely, you might notice that there are actually two columns of diff markers (the + and - characters to indicate lines added and removed). These come from the output of git diff-tree -cc, which is showing us the diff between each parent and the post-image of the given commit simultaneously.

In this particular example, the conflict occurs because one side has an extra argument in the dwim_ref() call, and the other includes an updated comment to use reflect renaming a variable from sha1 to oid. The left-most markers show the latter resolution, and the right-most markers show the former.

But this output can be understandably difficult to interpret. In Git 2.36, --remerge-diff takes a different approach. Instead of showing you the diffs between the merge resolution and each parent simultaneously, --remerge-diff shows you the diff between the file with merge conflicts, and the resolution.

The above shows the output of git show with --remerge-diff on the same conflicted merge commit as before. Here, we can see the diff3-style conflicts (shown in red, since the merge commit removes the conflict markers during resolution) along with the resolution. By more clearly indicating which parts of the conflict were left as-is, we can more easily see how the given commit resolved its conflicts, instead of trying to weave-together the simultaneous diff output from git diff-tree -cc.

Reconstructing these merges is made possible using ort. The ort engine is significantly faster than its predecessor, recursive, and can reconstruct all conflicted merge in linux.git in about 3 seconds (as compared to diff-tree -cc, which takes more than 30 seconds to perform the same operation

Give it a whirl in your Git repositories on 2.36 by running git show --remerge-diff on some merge conflicts in your history.


More flexible fsync configuration

If you have ever looked around in your repository’s .git directory, you’ll notice a variety of files: objects, references, reflogs, packfiles, configuration, and the like. Git writes these objects to keep track of the state of your repository, creating new object files when you make new commits, update references, repack your repository, and so on.

Most likely, you haven’t had to think too hard about how these files are written and updated. If you’re curious about these details, then read on! When any application writes changes to your filesystem, those changes aren’t immediately persisted, since writing to the external storage medium is significantly slower than updating your filesystem’s in-memory caches.

Instead, changes are staged in memory and periodically flushed to disk at which point the changes are (usually, though disks and controllers can have their own write caches, too) written to the physical storage medium.

Aside from following standard best-practices (like writing new files to a temporary location and then atomically moving them into place), Git has had a somewhat limited set of configuration available to tune how and when it calls fsync, mostly limited to core.fsyncObjectFiles, which, when set, causes Git to call fsync() when creating new loose object files. (Git has had non-configurable fsync() calls scattered throughout its codebase for things like writing packfiles, the commit-graph, multi-pack index, and so on).

Git 2.36 introduces a significantly more flexible set of configuration options to tune how and when Git will explicitly fsync lots of different kinds of files, not just if it fsyncs loose objects.

At the heart of this new change are two new configuration variables:
core.fsync and core.fsyncMethod. The former lets you pick a comma-separated list of which parts of Git’s internal data structures you want to be explicitly flushed after writing. The full list can be found in the documentation, but you can pick from things like pack (to fsync files in $GIT_DIR/objects/pack) or loose-object (to fsync loose objects), to reference (to fsync references in the $GIT_DIR/refs directory). There are also aggregate options like objects (which implies both loose-object and pack), along with others like derived-metadata, committed, and all.

You can also tune how Git ensures the durability of components included in your core.fsync configuration by setting the core.fsyncMethod to either fsync (which calls fsync(), or issues a special fcntl() on macOS), or writeout-only, which schedules the written data for flushing, though does not guarantee that metadata like directory entries are updated as part of the flush operation.

Most users won’t need to change these defaults. But for server operators who have many Git repositories living on hardware that may suddenly lose power, having these new knobs to tune will provide new opportunities to enhance the durability of written data.

[source, source, source]

Stricter repository ownership checks

If you haven’t seen our blog post from last week announcing the security patches for versions 2.35 and earlier, let me give you a brief recap.

Beginning in Git 2.35.2, Git changed its default behavior to prevent you from executing git commands in a repository owned by a different user than the current one. This is designed to prevent git invocations from unintentionally executing commands which the repository owner configured.

You can bypass this check by setting the new safe.directory configuration to include trusted repositories owned by other users. If you can’t upgrade immediately, our blog post outlines some steps you can take to mitigate your risk, though the safest thing you can do is upgrade to the latest version of Git.

Since publishing that blog post, the safe.directory option now interprets the value * to consider all Git repositories as safe, regardless of their owner. You can set this in your --global config to opt-out of the new behavior in situations where it makes sense.



Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

  • If you’ve ever spent time poking around in the internals of one of your Git repositories, you may have come across the git cat-file command. Reminiscent of cat, this command is useful for printing out the raw contents of Git objects in your repository. cat-file has a handful of other modes that go beyond just printing the contents of an object. Instead of printing out one object at a time, it can accept a stream of objects (via stdin) when passed the --batch or --batch-check command-line arguments. These two similarly-named options have slightly different outputs: --batch instructs cat-file to just print out each object’s contents, while --batch-check is used to print out information about the object itself, like its type and size1.

    But what if you want to dynamically switch between the two? Before, the only way was to run two separate copies of the cat-file command in the same repository, one in --batch mode and the other in --batch-check mode. In Git 2.36, you no longer need to do this. You can instead run a single git cat-file command with the new --batch-command mode. This mode lets you ask for the type of output you want for each object. Its input looks either like contents <object>, or info <object>, which correspond to the output you’d get from --batch, or --batch-check, respectively.

    For server operators who may have long-running cat-file commands intended to service multiple requests, --batch-command accepts a new flush command, which flushes the output buffer upon receipt.

    [source, source]

  • Speaking of Git internals, if you’ve ever needed to script around the contents of a tree object in your repository, then there’s no doubt that git ls-tree has come in handy.

    If you aren’t familiar with ls-tree, the gist is that it allows you to list the contents of a tree objects, optionally recursing through nested sub-trees. Its output looks something like this:

    $ git ls-tree HEAD -- builtin/
    100644 blob 3ffb86a43384f21cad4fdcc0d8549e37dba12227  builtin/add.c
    100644 blob 0f4111bafa0b0810ae29903509a0af74073013ff  builtin/am.c
    100644 blob 58ff977a2314e2878ee0c7d3bcd9874b71bfdeef  builtin/annotate.c
    100644 blob 3f099b960565ff2944209ba514ea7274dad852f5  builtin/apply.c
    100644 blob 7176b041b6d85b5760c91f94fcdde551a38d147f  builtin/archive.c

    Previously, the customizability of ls-tree‘s output was somewhat limited. You could restrict the output to just the filenames with --name-only, print absolute paths with --full-name, or abbreviate the object IDs with --abbrev, but that was about it.

    In Git 2.36, you have a lot more control about how ls-tree‘s should look. There’s a new --object-only option to complement --name-only. But if you really want to customize its output, the new --format option is your best bet. You can select from any combination and order of the each entry’s mode, type, name, and size.

    Here’s a fun example of where something like this might come in handy. Let’s say you’re interested in the distribution of file-sizes of blobs in your repository. Before, to get a list of object sizes, you would have had to do either:

    $ git ls-tree ... | awk '{ print $3 }' | git cat-file --batch-check='%(objectsize)'

    or (ab)use the --long format and pull out the file sizes of blobs:

    $ git ls-tree -l | awk '{ print $4 }'

    but now you can ask for just the file sizes directly, making it much more convenient to script around them:

    $ dist () {
     ruby -lne 'print 10 ** (Math.log10($_.to_i).ceil)' | sort -n | uniq -c
    $ git ls-tree --format='%(objectsize)' HEAD:builtin/ | dist
      8 1000
     59 10000
     53 100000
      2 1000000

    …showing us that we have 8 files that are between 1-10 KiB in size, 59 files between 10-100 KiB, 53 files between 100 KiB and 1 MiB, and 2 files larger than 1 MiB.

    [source, source, source, source]

  • If you’ve ever tried to track down a bug using Git, then you’re familiar with the git bisect command. If you haven’t, here’s a quick primer. git bisect takes two revisions of your repository, one corresponding to a known “good” state, and another corresponding to some broken state. The idea is then to run a binary search between those two points in history to find the first commit which transitioned the good state to the broken state.

    If you aren’t a frequent bisect user, you may not have heard of the git bisect run command. Instead of requiring you to classify whether each point along the search is good or bad, you can supply a script which Git will execute for you, using its exit status to classify each revision for you.

    This can be useful when trying to figure out which commit broke the build, which you can do by running:

    $ git bisect start <bad> <good>
    $ git bisect run make

    which will run make along the binary search between <bad> and <good>, outputting the first commit which broke compilation.

    But what about automating more complicated tests? It can often be useful to write a one-off shell script which runs some test for you, and then hand that off to git bisect. Here, you might do something like:

    $ vi test.sh
    # type type type
    $ git bisect run test.sh

    See the problem? We forgot to mark test.sh as executable! In previous versions of Git, git bisect would incorrectly carry on the search, classifying each revision as broken. In Git 2.36, git bisect will detect that you forgot to mark the script as executable, and halt the search early.


  • When you run git fetch, your Git client communicates with the remote to carry out a process called negotiation to determine which objects the server needs to send to complete your request. Roughly speaking, your client and the server mutually advertise what they have at the tips of each reference, then your client lists which objects it wants, and the server sends back all objects between the requested objects and the ones you already have.

    This works well because Git always expects to maintain closure over reachable objects2, meaning that if you have some reachable object in your repository, you also have all of its ancestors.

    In other words, it’s fine for the Git server to omit objects you already have, since the combination of the objects it sends along with the ones you already have should be sufficient to assemble the branches and tags your client asked for.

    But if your repository is corrupt, then you may need the server to send you objects which are reachable from ones you already have, in which case it isn’t good enough for the server to just send you the objects between what you have and want. In the past, getting into a situation like this may have led you to re-clone your entire repository.

    Git 2.36 ships with a new option to git fetch which makes it easier to recover from certain kinds of repository corruption. By passing the new --refetch option, you can instruct git fetch to fetch all objects from the remote, regardless of which objects you already have, which is useful when the contents of your objects directory are suspect.


  • Returning readers may remember our earlier discussions about the sparse index and sparse checkouts, which make it possible to only have part of your repository checked out at a time.

    Over the last handful of releases, more and more commands have become compatible with the sparse index. This release is no exception, with four more Git commands joining the pack. Git 2.36 brings sparse index support to git clean, git checkout-index, git update-index, and git read-tree.

    If you haven’t used these commands, there’s no need to worry: adding support to these plumbing commands is designed to lay the ground work for building a sparse index-aware git stash. In the meantime, sparse index support already exists in the commands that you are most likely already familiar with, like git status, git commit, git checkout, and more.

    As an added bonus, git sparse-checkout (which is used to enable the sparse checkout feature and dictate which parts of your repository you want checked out) gained support for the command-line completion Git ships in its contrib directory.

    [source, source, source]

  • Returning readers may remember our previous coverage on partial clones, a relatively new feature in Git which allows you to initialize your clones by downloading just some of the objects in your repository.

    If you used this feature in the past with git clone‘s --recurse-submodules flag, the partial clone filter was only applied to the top-level repository, cloning all of the objects in the submodules.

    This has been fixed in the latest release, where the --filter specification you use in your top-level clone is applied recursively to any submodules your repository might contain, too.

    [source, source]

  • While we’re talking about partial clones, now is a good time to mention partial bundles, which are new in Git 2.36. You may not have heard of Git bundles, which is a different way of transferring around parts of your repository.

    Roughly speaking, a bundle combines the data in a packfile, along with a list of references that are contained in the bundle. This allows you to capture information about the state of your repository into a single file that you can share. For example, the Git project uses bundles to share embargoed security releases with various Linux distribution maintainers. This allows us to send all of the objects which comprise a new release, along with the tags that point at them in a single file over email.

    In previous releases of Git, it was impossible to prepare a filtered bundle which you could apply to a partial clone. In Git 2.36, you can now prepare filtered bundles, whose contents are unpacked as if they arrived during a partial clone3. You can’t yet initialize a new clone from a partial bundle, but you can use it to fetch objects into a bare repository:

    $ git bundle create --filter=blob:none ../partial.bundle v2.36.0
    $ cd ..
    $ git init --bare example.repo
    $ git fetch --filter=blob:none ../partial.bundle 'refs/tags/*:refs/tags/*'
    [ ... ]
    From ../example.bundle
    * [new tag]             v2.36.0 -> v2.36.0

    [source, source]

  • Lastly, let’s discuss a bug fix concerning Git’s multi-pack reachability bitmaps. If you have started to use this new feature, you may have noticed a handful of new files in your .git/objects/pack directory:

    $ ls .git/objects/pack/multi-pack-index*

    In order, these are: the multi-pack index (MIDX) itself, the reachability bitmap data, and the reverse-index which tells Git which bits correspond to what objects in your repository.

    These are all associated back to the MIDX via the MIDX’s checksum, which is how Git knows that the three belong together. This release fixes a bug where the .rev file could fall out-of-sync with the MIDX and its bitmap, leading Git to report incorrect results when using a multi-pack bitmap. This happens when changing the object order of the MIDX without changing the set of objects tracked by the MIDX.

    If your .rev file has a modification time that is significantly older than the MIDX and .bitmap, you may have been bitten by this bug4. Luckily this bug can be resolved by dropping and regenerating your bitmaps5. To prevent a MIDX bitmap and its .rev file from falling out of sync again, the contents of the .rev are now included in the MIDX itself, forcing the MIDX’s checksum to change whenever the object order changes.


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.36, or any previous version in the Git repository.

  1. You can ask for other attributes, too, like %(objectsize:disk) which shows how many bytes it takes Git to store the object on disk (which can be smaller than %(objectsize) if, for example, the object is stored as a delta against some other, similar object). 
  2. This isn’t quite true, because of things like shallow and partial clones, along with grafts, but the assumption is good enough for our purposes here. What matters it that outside of scenarios where we expect to be missing objects, the only time we don’t have a reachability closure is when the repository itself is corrupt. 
  3. In Git parlance, this would be a packfile from a promisor remote
  4. This isn’t entirely fool-proof, since it’s possible way of detecting that this bug occurred, since it’s possible your bitmaps were rewritten after first falling out-of-sync. When this happens, it’s possible that the corrupt bitmaps are propagated forward when generating new bitmaps. You can use git rev-list --test-bitmap HEAD to check if your bitmaps are OK. 
  5. By first running rm -f .git/objects/pack/multi-pack-index*, and then
    git repack -d --write-midx --write-bitmap-index

Git Credential Manager: authentication for everyone

Post Syndicated from Matthew John Cheetham original https://github.blog/2022-04-07-git-credential-manager-authentication-for-everyone/

Universal Git Authentication

“Authentication is hard. Hard to debug, hard to test, hard to get right.” – Me

These words were true when I wrote them back in July 2020, and they’re still true today. The goal of Git Credential Manager (GCM) is to make the task of authenticating to your remote Git repositories easy and secure, no matter where your code is stored or how you choose to work. In short, GCM wants to be Git’s universal authentication experience.

In my last blog post, I talked about the risk of proliferating “universal standards” and how introducing Git Credential Manager Core (GCM Core) would mean yet another credential helper in the wild. I’m therefore pleased to say that we’ve managed to successfully replace both GCM for Windows and GCM for Mac and Linux with the new GCM! The source code of the older projects has been archived, and they are no longer shipped with distributions like Git for Windows!

In order to celebrate and reflect this successful unification, we decided to drop the “Core” moniker from the project’s name to become simply Git Credential Manager or GCM for short.

Git Credential Manager

If you have followed the development of GCM closely, you might have also noticed we have a new home on GitHub in our own organization, github.com/GitCredentialManager!

We felt being homed under github.com/microsoft or github.com/github didn’t quite represent the ethos of GCM as an open, universal and agnostic project. All existing issues and pull requests were migrated, and we continue to welcome everyone to contribute to the project.

GCM Home

Interacting with HTTP remotes without the help of a credential helper like GCM is becoming more difficult with the removal of username/password authentication at GitHub and Bitbucket. Using GCM makes it easy, and with exciting developments such as using GitHub Mobile for two-factor authentication and OAuth device code flow support, we are making authentication more seamless.

Hello, Linux!

In the quest to become a universal solution for Git authentication, we’ve worked hard on getting GCM to work well on various Linux distributions, with a primary focus on Debian-based distributions.

Today we have Debian packages available to download from our GitHub releases page, as well as tarballs for other distributions (64-bit Intel only). Being built on the .NET platform means there should be a reduced effort to build and run anywhere the .NET runtime runs. Over time, we hope to expand our support matrix of distributions and CPU architectures (by adding ARM64 support, for example).

Due to the broad and varied nature of Linux distributions, it’s important that GCM offers many different credential storage options. In addition to GPG encrypted files, we added support for the Secret Service API via libsecret (also see the GNOME Keyring), which provides a similar experience to what we provide today in GCM on Windows and macOS.

Windows Subsystem for Linux

In addition to Linux distributions, we also have special support for using GCM with Windows Subsystem for Linux (WSL). Using GCM with WSL means that all your WSL installations can share Git credentials with each other and the Windows host, enabling you to easily mix and match your development environments.

Easily mix and match your development environments

You can read more about using GCM inside of your WSL installations here.

Hello, GitLab

Being universal doesn’t just mean we want to run in more places, but also that we can help more users with whatever Git hosting service they choose to use. We are very lucky to have such an engaged community that is constantly working to make GCM better for everyone.

On that note, I am thrilled to share that through a community contribution, GCM now has support for GitLab.  Welcome to the family!

GCM for everyone

Look Ma, no terminals!

We love the terminal and so does GCM. However, we know that not everyone feels comfortable typing in commands and responding to prompts via the keyboard. Also, many popular tools and IDEs that offer Git integration do so by shelling out to the git executable, which means GCM may be called upon to perform authentication from a GUI app where there is no terminal(!)

GCM has always offered full graphical authentication prompts on Windows, but thanks to our adoption of the Avalonia project that provides a cross-platform .NET XAML framework, we can now present graphical prompts on macOS and Linux.

GCM continues to support terminal prompts as a first-class option for all prompts.

GCM continues to support terminal prompts as a first-class option for all prompts. We detect environments where there is no GUI (such as when connected over SSH without display forwarding) and instead present the equivalent text-based prompts. You can also manually disable the GUI prompts if you wish.

Securing the software supply chain

Keeping your source code secure is a critical step in maintaining trust in software, whether that be keeping commercially sensitive source code away from prying eyes or protecting against malicious actors making changes in both closed and open source projects that underpin much of the modern world.

In 2020, an extensive cyberattack was exposed that impacted parts of the US federal government as well as several major software companies. The US president’s recent executive order in response to this cyberattack brings into focus the importance of mechanisms such as multi-factor authentication, conditional access policies, and generally securing the software supply chain.

Store ALL the credentials

Git Credential Manager creates and stores credentials to access Git repositories on a host of platforms. We hold in the highest regard the need to keep your credentials and access secure. That’s why we always keep your credentials stored using industry standard encryption and storage APIs.

GCM makes use of the Windows Credential Manager on Windows and the login keychain on macOS.

In addition to these existing mechanisms, we also support several alternatives across supported platforms, giving you the choice of how and where you wish to store your generated credentials (such as GPG-encrypted credential files).

Store all your credentials

GCM can now also use Git’s git-credential-cache helper that is commonly built and available in many Git distributions. This is a great option for cloud shells or ephemeral environments when you don’t want to persist credentials permanently to disk but still want to avoid a prompt for every git fetch or git push.

Modern windows authentication (experimental)

Another way to keep your credentials safe at rest is with hardware-level support through technologies like the Trusted Platform Module (TPM) or Secure Enclave. Additionally, enterprises wishing to make sure your device or credentials have not been compromised may want to enforce conditional access policies.

Integrating with these kinds of security modules or enforcing policies can be tricky and is platform-dependent. It’s often easier for applications to hand over responsibility for the credential acquisition, storage, and policy
enforcement to an authentication broker.

An authentication broker performs credential negotiation on behalf of an app, simplifying many of these problems, and often comes with the added benefit of deeper integration with operating system features such as biometrics.

Authentication broker diagram

I’m happy to announce that GCM has gained experimental support for brokered authentication (Windows-only at the moment)!

On Windows, the authentication broker is a component that was first introduced in Windows 10 and is known as the Web Account Manager (WAM). WAM enables apps like GCM to support modern authentication experiences such as Windows Hello and will apply conditional access policies set by your work or school.

Please note that support for the Windows broker is currently experimental and limited to authentication of Microsoft work and school accounts against Azure DevOps.

Click here to read more about GCM and WAM, including how to opt-in and current known issues.

Even more improvements

GCM has been a hive of activity in the past 18 months, with too many new features and improvements to talk about in detail! Here’s a quick rundown of additional updates since our July 2020 post:

  • Automatic on-premises/self-hosted instance detection
  • GitHub Enterprise Server and GitHub AE support
  • Shared Microsoft Identity token caches with other developer tools
  • Improved network proxy support
  • Custom TLS/SSL root certificate support
  • Admin-less Windows installer
  • Improved command line handling and output
  • Enterprise default setting support on Windows
  • Multi-user support
  • Better diagnostics

Thank you!

The GCM team would also like to personally thank all the people who have made contributions, both large and small, to the project:

@vtbassmatt, @kyle-rader, @mminns, @ldennington, @hickford, @vdye, @AlexanderLanin, @derrickstolee, @NN, @johnemau, @karlhorky, @garvit-joshi, @jeschu1, @WormJim, @nimatt, @parasychic, @cjsimon, @czipperz, @jamill, @jessehouwing, @shegox, @dscho, @dmodena, @geirivarjerstad, @jrbriggs, @Molkree, @4brunu, @julescubtree, @kzu, @sivaraam, @mastercoms, @nightowlengineer

Future work

While we’ve made a great deal of progress toward our universal experience goal, we’re not slowing down anytime soon; we’re still full steam ahead with GCM!

Our focus for the next period will be on iterating and improving our authentication broker support, providing stronger protection of credentials, and looking to increase performance and compatibility with more environments and uses.

Highlights from Git 2.35

Post Syndicated from Taylor Blau original https://github.blog/2022-01-24-highlights-from-git-2-35/

The open source Git project just released Git 2.35, with features and bug fixes from over 93 contributors, 35 of them new. We last caught up with you on the latest in Git back when 2.34 was released. To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

  • When working on a complicated change, it can be useful to temporarily discard parts of your work in order to deal with them separately. To do this, we use the git stash tool, which stores away any changes to files tracked in your Git repository.

    Using git stash this way makes it really easy to store all accumulated changes for later use. But what if you only want to store part of your changes in the stash? You could use git stash -p and interactively select hunks to stash or keep. But what if you already did that via an earlier git add -p? Perhaps when you started, you thought you were ready to commit something, but by the time you finished staging everything, you realized that you actually needed to stash it all away and work on something else.

    git stash‘s new --staged mode makes it easy to stash away what you already have in the staging area, and nothing else. You can think of it like git commit (which only writes staged changes), but instead of creating a new commit, it writes a new entry to the stash. Then, when you’re ready, you can recover your changes (with git stash pop) and keep working.


  • git log has a rich set of --format options that you can use to customize the output of git log. These can be handy when sprucing up your terminal, but they are especially useful for making it easier to script around the output of git log.

    In our blog post covering Git 2.33, we talked about a new --format specifier called %(describe). This made it possible to include the output of git describe alongside the output of git log. When it was first released, you could pass additional options down through the %(describe) specifier, like matching or excluding certain tags by writing --format=%(describe:match=<foo>,exclude=<bar>).

    In 2.35, Git includes a couple of new ways to tweak the output of git describe. You can now control whether to use lightweight tags, and how many hexadecimal characters to use when abbreviating an object identifier.

    You can try these out with %(describe:tags=<bool>) and %(describe:abbrev=<n>), respectively. Here’s a goofy example that gives me the git describe output for the last 8 commits in my copy of git.git, using only non-release-candidate tags, and uses 13 characters to abbreviate their hashes:

    $ git log -8 --format='%(describe:exclude=*-rc*,abbrev=13)'

    Which is much cleaner than this alternative way to combine git log and git describe:

    $ git log -8 --format='%H' | xargs git describe --exclude='*-rc*' --abbrev=13


  • In our last post, we talked about SSH signing: a new feature in Git that allows you to use the SSH key you likely already have in order to sign certain kinds of objects in Git.

    This release includes a couple of new additions to SSH signing. Suppose you use SSH keys to sign objects in a project you work on. To track which SSH keys you trust, you use the allowed signers file to store the identities and public keys of signers you trust.

    Now suppose that one of your collaborators rotates their key. What do you do? You could update their entry in the allowed signers file to point at their new key, but that would make it impossible to validate objects signed with the older key. You could store both keys, but that would mean that you would accept new objects signed with the old key.

    Git 2.35 lets you take advantage of OpenSSH’s valid-before and valid-after directives by making sure that the object you’re verifying was signed using a signature that was valid when it was created. This allows individuals to rotate their SSH keys by keeping track of when each key was valid without invalidating any objects previously signed using an older key.

    Git 2.35 also supports new key types in the user.signingKey configuration when you include the key verbatim (instead of storing the path of a file that contains the signing key). Previously, the rule for interpreting user.signingKey was to treat its value as a literal SSH key if it began with “ssh-“, and to treat it as filepath otherwise. You can now specify literal SSH keys with keytypes that don’t begin with “ssh-” (like ECDSA keys).

    [source, source]

  • If you’ve ever dealt with a merge conflict, you know that accurately resolving conflicts takes some careful thinking. You may not have heard of Git’s merge.conflictStyle setting, which makes resolving conflicts just a little bit easier.

    The default value for this configuration is “merge”, which produces the merge conflict markers that you are likely familiar with. But there is a different mode, “diff3”, which shows the merge base in addition to the changes on either side.

    Git 2.35 introduces a new mode, “zdiff3”, which zealously moves any lines in common at the beginning or end of a conflict outside of the conflicted area, which makes the conflict you have to resolve a little bit smaller.

    For example, say I have a list with a placeholder comment, and I merge two branches that each add different content to fill in the placeholder. The usual merge conflict might look something like this:

    <<<<<<< HEAD
    >>>>>>> side

    Trying again with diff3-style conflict markers shows me the merge base (revealing a comment that I didn’t know was previously there) along with the full contents of either side, like so:

    <<<<<<< HEAD
    ||||||| 60c6bd0
    # add more here
    >>>>>>> side

    The above gives us more detail, but notice that both sides add “foo” and, “bar” at the beginning and “baz” at the end. Trying one last time with zdiff3-style conflict markers moves the “foo” and “bar” outside of the conflicted region altogether. The result is both more accurate (since it includes the merge base) and more concise (since it handles redundant parts of the conflict for us).

    <<<<<<< HEAD
    ||||||| 60c6bd0
    # add more here
    >>>>>>> side


  • You may (or may not!) know that Git supports a handful of different algorithms for generating a diff. The usual algorithm (and the one you are likely already familiar with) is the Myers diff algorithm. Another is the --patience diff algorithm and its cousin --histogram. These can often lead to more human-readable diffs (for example, by avoiding a common issue where adding a new function starts the diff by adding a closing brace to the function immediately preceding the new one).

    In Git 2.35, --histogram got a nice performance boost, which should make it faster in many cases. The details are too complicated to include in full here, but you can check out the reference below and see all of the improvements and juicy performance numbers.


  • If you’re a fan of performance improvements (and diff options!), here’s another one you might like. You may have heard of git diff‘s --color-moved option (if you haven’t, we talked about it back in our Highlights from Git 2.17). You may not have heard of the related --color-moved-ws, which controls how whitespace is or isn’t ignored when colorizing diffs. You can think of it like the other space-ignoring options (like --ignore-space-at-eol, --ignore-space-change, or --ignore-all-space), but specifically for when you’re running diff in the --color-moved mode.

    Like the above, Git 2.35 also includes a variety of performance improvement for --color-moved-ws. If you haven’t tried --color-moved yet, give it a try! If you already use it in your workflow, it should get faster just by upgrading to Git 2.35.


  • Way back in our Highlights from Git 2.19, we talked about how a new feature in git grep allowed the git jump addon to populate your editor with the exact locations of git grep matches.

    In case you aren’t familiar with git jump, here’s a quick refresher. git jump populates Vim’s quickfix list with the locations of merge conflicts, grep matches, or diff hunks (by running git jump merge, git jump grep, or git jump diff, respectively).

    In Git 2.35, git jump merge learned how to narrow the set of merge conflicts using a pathspec. So if you’re working on resolving a big merge conflict, but you only want to work on a specific section, you can run:

    $ git jump merge -- foo

    to only focus on conflicts in the foo directory. Alternatively, if you want to skip conflicts in a certain directory, you can use the special negative pathspec like so:

    # Skip any conflicts in the Documentation directory for now.
    $ git jump merge -- ':^Documentation'


  • You might have heard of Git’s “clean” and “smudge” filters, which allow users to specify how to “clean” files when staging, or “smudge” them when populating the working copy. Git LFS makes extensive use of these filters to represent large files with stand-in “pointers.” Large files are converted to pointers when staging with the clean filter, and then back to large files when populating the working copy with the smudge filter.

    Git has historically used the size_t and unsigned long types relatively interchangeably. This is understandable, since Git was originally written on Linux where these two types have the same width (and therefore, the same representable range of values).

    But on Windows, which uses the LLP64 data model, the unsigned long type is only 4 bytes wide, whereas size_t is 8 bytes wide. Because the clean and smudge filters had previously used unsigned long, this meant that they were unable to process files larger than 4GB in size on platforms conforming to LLP64.

    The effort to standardize on the correct size_t type to represent object length continues in Git 2.35, which makes it possible for filters to handle files larger than 4GB, even on LLP64 platforms like Windows1.


  • If you haven’t used Git in a patch-based workflow where patches are emailed back and forth, you may be unaware of the git am command, which extracts patches from a mailbox and applies them to your repository.

    Previously, if you tried to git am an email which did not contain a patch, you would get dropped into a state like this:

    $ git am /path/to/mailbox
    Applying: [...]
    Patch is empty.
    When you have resolved this problem, run "git am --continue".
    If you prefer to skip this patch, run "git am --skip" instead.
    To restore the original branch and stop patching, run "git am --abort".

    This can often happen when you save the entire contents of a patch series, including its cover letter (the customary first email in a series, which contains a description of the patches to come but does not itself contain a patch) and try to apply it.

    In Git 2.35, you can specify how git am will behave should it encounter an empty commit with --empty=<stop|drop|keep>. These options instruct am to either halt applying patches entirely, drop any empty patches, or apply them as-is (creating an empty commit, but retaining the log message). If you forgot to specify an --empty behavior but tried to apply an empty patch, you can run git am --allow-empty to apply the current patch as-is and continue.


  • Returning readers may remember our discussion of the sparse index, a Git features that improves performance in repositories that use sparse-checkout. The aforementioned link describes the feature in detail, but the high-level gist is that it stores a compacted form of the index that grows along with the size of your checkout rather than the size of your repository.

    In 2.34, the sparse index was integrated into a handful of commands, including git status, git add, and git commit. In 2.35, command support for the sparse index grew to include integrations with git reset, git diff, git blame, git fetch, git pull, and a new mode of git ls-files.

    [source, source, source]

  • Speaking of sparse-checkout, the git sparse-checkout builtin has deprecated the git sparse-checkout init subcommand in favor of using git sparse-checkout set. All of the options that were previously available in the init subcommand are still available in the set subcommand. For example, you can enable cone-mode sparse-checkout and include the directory foo with this command:

    $ git sparse-checkout set --cone foo


  • Git stores references (such as branches and tags) in your repository in one of two ways: either “loose” as a file inside of .git/refs (like .git/refs/heads/main) or “packed” as an entry inside of the file at .git/packed_refs.

    But for repositories with truly gigantic numbers of references, it can be inefficient to store them all together in a single file. The reftable proposal outlines the alternative way that JGit stores references in a block-oriented fashion. JGit has been using reftable for many years, but Git has not had its own implementation.

    Reftable promises to improve reading and writing performance for repositories with a large number of references. Work has been underway for quite some time to bring an implementation of reftable to Git, and Git 2.35 comes with an initial import of the reftable backend. This new backend isn’t yet integrated with the refs, so you can’t start using reftable just yet, but we’ll keep you posted about any new developments in the future.


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.35, or any previous version in the Git repository.

  1. Note that these patches shipped to Git for Windows via its 2.34 release, so technically this is old news! But we’ll still mention it anyway. 

Highlights from Git 2.34

Post Syndicated from Taylor Blau original https://github.blog/2021-11-15-highlights-from-git-2-34/

The open source Git project just released Git 2.34 with features and bug fixes from over 109 contributors, 29 of them new. We last caught up with you on the latest in Git back when 2.33 was released. To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Sparse index

In the past, we’ve talked about new Git features to make it possible to work with large repositories, like partial clones and sparse-checkout. For a complete description, check out the linked blog posts. But as a refresher, these two features work together to allow you to:

  • Fetch or clone only part of a repository’s objects, and
  • Only populate part of your working copy, typically scoped to a set of

This pair of features is designed to create the illusion that you are working in a much smaller repository than you actually are. For instance, if your work takes place in an all-encompassing monorepo, your local copy only needs to contain the parts of the repository that you frequently work in.

But often, this illusion falls short. Why? The answer is the index. The index is the data structure Git uses to track what will be written the next time you run git commit, as well as to track the state of every file in your repository at the current point in history.

As you can imagine, even if you are working in a small corner of a large repository, the index still has to keep track of the repository’s entire contents, not just the parts that you are working in. Unfortunately, that overhead adds up: every time Git needs work with the index, it needs to parse and write out a lot of data that doesn’t affect the parts of your repository outside of your sparse checkout.

That’s changing in this release with the addition of a sparse-enabled index. Unlike the index of previous versions, this release enables the index to only track the parts of your repository that you care about. Specifically, it only contains entries for parts of your repository that are either in your sparse checkout, or at the boundary between your sparse checkout and the rest of the repository.

Collapsing to a sparse index

Triangles represent trees and boxes represent blobs. Left: a representation of a non-sparse index’s contents. Right: a sparse-ified index.

The high-level details here are that the index format now understands that specially marked directories indicate the boundary between the contents of your sparse checkout and the parts of your repository that you don’t have checked out. But the process of implementing this new format, teaching sub-commands how to use it, and making sure that the sparse index can be expanded to a full index is much more detailed.

For all of the details behind this exciting new feature, check out a comprehensive blog post published by Derrick Stolee last week: Making your monorepo feel small with Git’s sparse index.

[source, source, source, source, source, source, source, source]

Multi-pack reachability bitmaps

In a previous blog post, we talked about a new feature to enable reachability bitmaps to keep track of objects stored in multiple packs within your object store.

This release of Git contains the remaining components described in that blog post. If you haven’t read it, here’s a summary. When serving a fetch, a Git server needs to send the client everything reachable from the set of objects they want, less anything reachable from the set that they already have. (You can think of a clone as a “special case” fetch where the client wants everything and has nothing).

In order to compute this set efficiently, Git can use reachability bitmaps. One of these .bitmap files stores a set of bitmaps, each corresponding to some commit. The contents of an individual bitmap is a string of bits, one per object, indicating which objects are reachable from each commit.

In the past, the contents of a reachability bitmap were tied to the order of objects within a single packfile. This meant that a bitmap could only cover objects in one packfile. In other words, bitmaps were only useful if you could efficiently pack the entire contents of your repository down into a single packfile.

For many repositories, writing all objects into the same pack is completely feasible. But the effort it takes to write a pack (including searching for deltas between objects, compressing individual objects, and I/O cost) scales with the size of the pack you’re writing.

Git 2.34 introduces a new bitmap format that is instead tied to the contents of the multi-pack index file. This means that a bitmap can now flexibly represent objects in multiple packs, and server operators no longer need to repack their biggest repositories into a single pack in order to take full advantage of reachability bitmaps.

For more details, including some of the steps required to make this new feature work, see the aforementioned blog post.

[source, source, source]

A new default merge strategy

In an earlier blog post, we explained Git’s newest merge strategy: ort. Here are some of the basics:

When Git needs to merge two branches, it uses one of several “strategy” backends in order to resolve the changes or emit conflicts when two changes cannot be reconciled.

For years, Git has used a strategy called “recursive”. If you have ever done a merge in Git without passing -s <strategy>, then you have almost certainly used the recursive engine. Recursive behaves mostly like a standard three-way merge, with one exception. In the case of “criss-cross” merges (where there isn’t a single merge base), recursive merges multiple bases together in pairs (recursively) in order to produce a single tree which is then treated as the new merge base. This makes it possible to resolve cases where a traditional three-way merge might produce a conflict.

In recent versions of Git, there has been an ongoing effort to replace the recursive strategy with a new one called ort (short for “ostensibly recursive‘s twin”). Why do this? There are a few reasons, but perhaps the most compelling is that a rewrite allowed Git to implement a merge strategy that doesn’t operate on the index (that same one we talked about a couple of sections ago)!

ort does just that: it’s a full-blown rewrite of the merge strategy that aims to emulate the same concepts behind recursive while avoiding many of its long-standing performance and correctness problems. In a merge containing many renames, ort outperforms recursive by 500x. For a series of similar merges (like in a rebase operation), the speedup is over 9000x, in part due to ort’s ability to cache and reuse results from previous merges.

These numbers show off some of the worst-case scenarios for recursive, but in testing, ort consistently outperforms recursive with much less variance. In Git 2.34, ort is now the default merge strategy, so you should notice faster merges with fewer bugs just by upgrading.

For more details about the ort merge strategy, see our earlier blog post, or any one of a six-part series of posts written by ort‘s creator, Elijah Newren: part one, part two, part three, part four, part five, and part six.



Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

  • You might be aware that Git allows you to sign your work by attaching your PGP signature to certain objects. For example, the Git project itself publishes tags signed by the maintainer in order to verify that each release comes from someone trustworthy.

    But the experience of using GPG and maintaining keys can be somewhat
    cumbersome. One alternative is to use a new feature of OpenSSH (released
    back in OpenSSH 8.0) that allows using the SSH key you likely already have as a signing key.

    Git 2.34 includes support to take advantage of this feature and allows you to sign your work using SSH keys. To try it out, you can either set user.signingKey to the SSH key you want to use (for example, by asking your ssh-agent for a list with ssh-add -L), or set gpg.format to ssh and gpg.ssh.defaultKeyCommand to ssh-add -L in order to automatically use the first SSH key available.

    After configuring Git to sign objects using your SSH keys, you can use git
    commit -S
    , git merge -S, and git tag -s as usual, and they will automatically use your SSH key.

    For more information about the new configuration options, including information about how to verify SSH signatures with an “allowed signers” file, check out the documentation.

    [source, source, source]

  • If you’ve ever accidentally typed git psuh when you meant push, you
    might have seen this message:

    $ git psuh
    git: 'psuh' is not a git command. See 'git --help'.
    The most similar command is

    You have always been able to control this behavior by setting the
    help.autoCorrect configuration. You can hide this advice by setting that
    configuration to never, or let Git automatically rerun the most similar
    command for you immediately or with a delay (by setting immediate, or a
    real number of seconds to wait before rerunning your command).

    In Git 2.34, you can now configure Git to ask you interactively whether you
    want to rerun your last operation with the suggested command by setting
    help.autoCorrect to prompt.


In Git 2.34, a handful of patch series were focused on improving the performance of interacting with other repositories. Here’s a pair of tidbits that improves the performance of git fetch and git push:

  • When fetching from a remote, your client needs to do some bookkeeping before
    and after it receives a set of objects from the remote.

    Before anything happens, your client needs to figure out what it has in common with the remote it’s fetching from, and what commits it wants as a result. Previously, this process was somewhat wasteful: Git used to load commit objects directly when they could instead have been read from the commit-graph, resulting in much improved performance. In Git 2.34, commits loaded in this code path use the commit-graph when possible. The effect of this scales with the number of references in your repository: in an example repository with over 2 million references, it cuts the time it takes to fetch a single commit by more than half.


  • Another patch series made a handful of improvements to updating local references when fetching, along with some changes to improve fetch negotiation, as well as skipping the connectivity check (which I’ll talk about in more detail in the next tidbit) when the receiving end had already verified the connectedness of the new objects. These changes together contributed similarly impressive performance improvements to the git fetch command.


You might have heard of “submodules,” the Git feature that allows combining multiple repositories by storing links to other repositories. Submodules have been somewhat neglected over the years, but this release brought renewed attention to the feature. Here are just some of the changes that enhance submodules:

  • It might be a surprise to learn that, though the majority of Git is written in C, the original git submodule command is actually a shell script!

    The Git project has been converting many of its subcommands written in other languages into C. Reimplementing subcommands as C programs means that
    they can be read and written more easily, take advantage of Git’s comprehensive libraries, and avoid the overhead of spawning many processes, especially on platforms where the new process overhead is rather costly.

    In Git 2.34, many parts of the git submodule command were rewritten in C.
    This project was completed by Atharva Raykar, who is a Google Summer of Code
    student. You can check out their final report here, along with Git’s other GSoC participant ZheNing Hu’s report here.

    [source, source, source]

  • While we’re on the topic of submodules, one thing you might not know is that
    when using commands that deal with objects from both the submodule and the
    repository containing it, the submodule is temporarily added as an alternate
    object store of the other repository!

    Alternates are Git’s object borrowing mechanism, which allow you to in effect link multiple object stores together. When using a repository with alternates, any object lookups that fail to find an object are retried in that repository’s alternate.

    In order to make both the objects in a submodule and the objects in the repository that contains that submodule available to git grep (among a select set of
    other commands), the submodule would temporarily be added as an alternate for the duration of that command.

    If you’re thinking to yourself, “this is a hack”, then you’re not alone. Git has made internal changes to parameterize many functions in terms of a repository (which is usually the global the_repository). This allowed Git to avoid combining multiple repositories via alternates and instead make function calls by passing two (or more) separate repository instances. This enables Git to avoid hackily relying on the alternates mechanism, which produced less confusing and error-prone code as a result.

    [source, source, source]

  • One last submodule-related topic (though there are more we couldn’t fit here!). If you are cloning a repository that you know to contain submodules, it is often useful to pass the --recurse-submodules, which will cause that repository’s submodules to be cloned and initialized, too.

    But other commands that can optionally recurse into submodules (like git diff, for example) don’t themselves recurse into submodules by default, even when you cloned with --recurse-submodules. In Git 2.34, this is no longer the case, with one caveat: when cloning with --recurse-submodules, other commands only recurse into submodules if the submodule.stickyRecursiveClone configuration is set, to prevent commands from unintentionally running in submodules.


Now that I’ve listed out a few of the submodule-related changes, let’s get back
to the rest of the tidbits:

  • If you’ve ever scripted around Git, you have almost certainly run into Git’s cat-file plumbing command. This tool can be used to print out a single object (by providing the object name as an argument), a stream of objects (by providing line-delimited object names over stdin), or all objects in your repository (with --batch-all-objects).

    This low-level command accidentally took into account replace refs, which produced confusing results when combined with --batch-all-objects, resulting in it not actually showing all objects in your repository if some were hidden by refs/replace.

    Dropping support for replacement refs made it possible for cat-file to reuse some information when it is given --batch-all-objects. Namely, to populate the list of objects, it iterates each object in each pack and therefore knows the byte offset within each pack where each object can be found. Previous versions of Git did not reuse this information when looking up objects to parse them, but Git 2.34 retains this information.

    This makes it possible to process an object’s metadata much more quickly by avoiding having to locate it twice. In a copy of torvalds/linux, the time it takes to print the name and type of each object (for the curious, that’s git cat-file --batch-check='%(objectname) %(objecttype)' --batch-all-objects --unordered) dropped from 8.1 seconds to just 4.3 seconds.


  • There has been a recent concerted effort to remove some memory leaks from Git’s code. Unlike library code, Git typically has a very short runtime. This makes the need to free allocated memory much less urgent, since if a process is about to exit, all memory allocated to it will be “freed” by the operating system.

    A recent patch has made it so that Git’s integration tests can be run in a mode that ensures no memory is leaked (by setting GIT_TEST_PASSING_SANITIZE_LEAK=true in the environment). Since Git’s test suite still contains memory leaks in some tests, a new mode was added to run only tests that have been specifically marked as being leak-free. That way, when Git is compiled with leak detection (by running make SANITIZE=leak), you can easily spot regressions in tests that were supposedly leak-free.

    Building off this new infrastructure, there have been many patch series that remove leaks from the code in various places.

    [source, source, source, source, source, source, source, source, source, source, source]

  • When you need to get some debugging information out of a Git process, like what version you’re running, or how much time it spent in a particular region, the trace2 mechanism is a good choice. Often, looking at these logs is like looking at a piece of a puzzle. For example, when you run git fetch, you actually run git fetch-pack, which then invokes git upload-pack on the remote, which itself invokes git pack-objects.

    Trace2 output includes information about when child processes are started and stopped (and consequently, how long they took to run), but what if you’re trying to figure out something more basic than that, like what process you were started by? In other words, if you’re stuck looking at output from a slow git pack-objects, how do you figure out whether it was a fetch (in which case it would have been started by upload-pack) or part of a repository repack (which here would be started by git repack)?

    Git 2.34 includes additional debugging information in trace2 output to indicate the full ancestry of a process, so you can easily read out the name of the program a process was started by, like so:

    $ cat trace2.log
    21:14:38.170730 common-main.c:48                  version 2.34.0.rc1.14.g88d915a634
    21:14:38.170810 common-main.c:49                  start /home/ttaylorr/src/git/git pack-objects git pack-objects --revs --thin --stdout --progress --delta-base-offset
    21:14:38.174325 compat/linux/procinfo.c:170       cmd_ancestry sh <- git-upload-pack <- sh <- git <- zsh <- sshd <- systemd

    (Above, you can see that pack-objects was run by git upload-pack, which was run by sh–that’s where we inserted the trace point via uploadpack.packObjectsHook, which was run by git, in my shell, over sshd, which was started by systemd.)

    [source, source]

  • In a previous post, we talked about the background maintenance daemon, which can be used to perform routine repository maintenance in the background (like pre-fetching, or repacking the objects in your repository).

    When this feature was first released back in Git 2.31, it had support for cron on Linux, launchctl on macOS, and schtasks on Windows. Git 2.34 brings support for systemd-based timers on Linux. This has a few benefits over cron: cron may not be available everywhere, and using systemd isolates each service into its own cgroup and writes its logs separately.

    If you want to use systemd instead of the default scheduler, you can run:

    $ git maintenance start --scheduler=systemd


  • In a previous blog post, we talked about how git rebase works, and how to move a complicated branching structure elsewhere in your repository’s history.

    The brief history is that this used to be done with the --preserve-merges option, which attempted to replay merges elsewhere in history. Confusingly, this mode uses rebase’s interactive machinery internally, so attempting to manually edit the rebase sequence (with git rebase -i) often produced counterintuitive results.

    The --rebase-merges option fixed many of these issues and has been the recommended replacement of --preserve-merges for some time now. In Git 2.34, the --preserve-merges option is now gone for good.


  • You might have used git grep to quickly search through your code. But you might not have known that git log has a --grep=<expression> option, which allows you to filter through commits produced by git log to only show ones whose commit messages match the provided expression.

    In previous versions, the --grep option only filtered down which results were presented in the output of git log. But in Git 2.34, git log now knows how to colorize the parts of its output that matched the provided expression, like so:


  • Last but not least, if you’re using Git in a terminal on Windows, you might have noticed that your terminal is left in a weird state after running git commit, or git rebase, like in this issue.

    This was because Git shares its terminal with any child processes it spawns, including your $EDITOR. If your editor sets special terminal settings but does not clear them upon exiting, it can leave your terminal in a broken state.

    Git 2.34 introduces functionality to save and restore the terminal settings before and after launching your editor. That means that even misbehaving editors cannot corrupt your terminal since it will always be restored to the state it was in before launching the editor.


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.34, or any previous version in the Git repository.

Make your monorepo feel small with Git’s sparse index

Post Syndicated from Derrick Stolee original https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/

One way that Git scales to the largest monorepos is the sparse-checkout feature, which allows you to focus on a subset of the files. This is supposed to make it feel like you are actually in a small repository, even though you are contributing to a large repository.

There’s only one problem: the Git index is still large in a monorepo, and users can feel it. Until now.

The Git index is a critical data structure in Git. It serves as the “staging area” between the files you have on your filesystem and your commit history. When you run git add, the files from your working directory are hashed and stored as objects in the index, leading them to be “staged changes”. When you run git commit, the staged changes as stored in the index are used to create that new commit. When you run git checkout, Git takes the data from a commit and writes it to the working directory and the index.

The working directory, index, and commit history

In addition to storing your staged changes, the index also stores filesystem information about your working directory. This helps Git report changed files more quickly. One problem is that the index stores this information for every file at HEAD, even if those files are outside of the sparse-checkout definition. This means that the index can be much larger in a monorepo than it would be if your important subset of files was in its own repository.

Throughout the past year, the Git Fundamentals team at GitHub contributed a new feature to Git called the sparse index, which allows the index to focus on the files within the sparse-checkout cone. If you are in a repository that can use sparse-checkout, then you can enable the sparse index using these commands:

git sparse-checkout init --cone --sparse-index
git sparse-checkout set <dir1> <dir2> ... <dirN>

The size of the sparse index will scale with the number of files within your chosen directories, instead of the full size of your repository. When enabled with a number of other performance features, this can have a dramatic performance impact.

As we built the sparse index, we tested its performance against a large monorepo that has over two million files at HEAD, and with a sparse-checkout definition that populated the working directory with about 100,000 of those files. We then compared the performance between having a normal “full” index, a sparse index, and a repository that only contained the files matching the sparse-checkout definition.

Command performance by index and repository type

The chart above demonstrates the significant performance improvements enabled by the sparse index. The bottom bars for each Git command show the runtime without the sparse index. The middle bars show the runtime of the same commands with the sparse index enabled. The top bars show the runtime of these commands, except in a repository that only contained the files within the sparse-checkout cone, representing the theoretical optimum. Since the sparse index still contains pointers to the rest of the monorepo, there is still some overhead. This overhead is hardly noticeable, since the difference is at most 60 milliseconds, even in the worst case above.

Today, I will go deep into the design and implementation of the sparse index. In particular, I’ll focus on how the Git community made such a significant change to a critical data structure in a safe way. I will include links to the actual changes in the Git codebase as I go.

This post is going to go deep into the guts of Git. If you are unfamiliar with Git’s object model, then learn about commits, trees, and blobs before continuing.

You can also get an overview of the sparse index alongside several other advanced Git features in this presentation I gave with colleague Lessley Dennington at the GitHub Nova event.

First, let’s dig into the Git index to understand its structure and purpose. I’ll use the derrickstolee/trace2-flamechart repository as a concrete example (and to generate some of the figures). If you want to follow along with the Git commands shown, then clone that repository.

The Git index

The index file stores a list of every file at HEAD, along with the object ID for its blob and some metadata. This list of files is stored as a flat list and Git parses the index into an array.

You can expose the list of files in the index using the git ls-files command:

$ git ls-files

In this repository, many of the files live in an examples/ directory, but you don’t actually need them for the functionality of the code, which lives in the bin/ directory. You can focus the repository only on the necessary files using the git sparse-checkout command:

$ ls
LICENSE     README.md     bin       examples   package.json
$ git sparse-checkout init --cone --sparse-index
$ git sparse-checkout set bin
$ ls
LICENSE     README.md     bin       package.json

Even though I used the sparse-checkout command to reduce the size of the working directory, my git ls-files command will return the same set of files as before. In fact, I can dig in a little more and expose some more information using git ls-files --debug.

$ git ls-files --debug
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 0
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 0
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 0
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000

The above output is truncated, but it shows that each index entry contains additional filesystem information for each path. The last entry listed shows what happens for a file that is outside of the sparse-checkout definition: all of the filesystem information is removed and the flags entry has some bits enabled. These bits include a SKIP_WORKTREE bit that signifies that Git will not write that file to the working directory.

If these files are not written to disk, then why are they listed in the index at all? The reason is that Git still needs to understand the content that would be there if the index was expanded. Further, that information is used to generate a commit with the git commit command.

In the Git codebase, there is a test helper that can show additional information from the index: test-tool read-cache. (You won’t have this command if you just have normal Git installed.) Running it here, you can see that the index also stores the object IDs for every file:

$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
100644 blob e0f88531b916b92821476760672e8161b9954898    examples/fetch/git-fetch-after.png
100644 blob f4a523cd1acb0a9d2620970ad7a43405d6e305dc    examples/fetch/git-fetch-after.svg
100644 blob fc4e30dca5fcb0c3d2031dc82a43d5d644e26b41    examples/fetch/git-fetch-after.txt
100644 blob 15dc889965617df3b5a30cf01e52c491e41c59c1    examples/fetch/git-fetch-before.png
100644 blob 602bd5bcbd815914a035d0d4f0d2a3896f600de2    examples/fetch/git-fetch-before.svg
100644 blob bc40a8e4658d17c35de996f3655e737b85ce7ad9    examples/fetch/git-fetch-before.txt
100644 blob 356cdd36e0d78a62af8b010d25d658054bb6fdc7    examples/fetch/git-fetch-combined.png
100644 blob cc0c23f2c8a822c51a17c46268f38c2268b400ae    examples/fetch/git-fetch-combined.svg
100644 blob dfc0893d172d841d971e206461466db935b7c192    examples/maintenance/trace.png
100644 blob a10a876472e46c6ae58e6fc6e2adc64d4dae809b    examples/maintenance/trace.svg
100644 blob 8f5e8bfbc44674feb3aa96e0b7bf1bf717495658    examples/maintenance/trace.txt
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Notice in particular how every single row lists the object type as blob, meaning a file. Also, the bin/index.js file has executable permissions, so the file permission column shows 100755 instead of 100644 like the others. This is all important metadata to store for each index entry.

To visualize the index, the diagram below displays our blobs as boxes in a line in the order given above, but it represents the trees that connect the root tree to those blobs as triangles. Thus, the root tree has two subtrees for bin/ and examples/, and the examples/ subtree has two subtrees for fetch/ and maintenance/.

full index

This figure represents all of the links between the trees and blobs. However, the core index data structure stores only the list of blobs as a flat list. The nesting tree structure does not exist in the core of the file.

However, there is an extension to the format that includes the information of the nesting directories: the cache-tree extension. Each node of the cache-tree stores a list of sub-nodes and a range of index entries that are covered by the current node. Each node stores the object ID for the tree it represents.

The index and the cache tree

The root node always covers all of the blobs in the index. The contained nodes have ranges contained within that range.

Git commands such as git add update the cache-tree extension in order to make the next git commit command very fast. To create the new commit, Git can use the tree from the root of the cache-tree extension.

Many Git commands use the index in many different ways. Some commands compare the working directory to the index and update one or the other when there is a difference. Others compare the index and a commit. Some compare multiple indexes together. Some use the cache-tree extension to navigate the nested tree structure, but mostly they scan the flat list of files in the form of an array.

The index affects Git performance at scale

The index can be very large in monorepos. I will show Git performance data from an example monorepo that has over two million files at HEAD. Even using the latest compression techniques available, the index file is over 180 MB in this monorepo.

This has a significant effect on normal Git commands. Presented below is an annotated flamechart of a git status command with one of these large indexes. The x-axis represents time since the start of the command, and each rectangle represents a region of Git’s execution that is marked by its trace2 logging library.

`git status` with full index

Three regions are annotated here:

  1. The index is read from disk and parsed into memory.
  2. The working directory is compared to the index. This triggers a lazy initialization of some hash tables that are required for this effort.
  3. The modified index is written to disk.

Parsing is multi-threaded, but writes are not. This explains some of the differences in how long those actions take.

Clearly, the amount of data in the index is a significant portion of this command. This also affects other commands such as git add and git commit, which are expected to be fast.

This performance concern became abundantly clear when our monorepo customer wanted to group more dependencies within the monorepo. Some teams had isolated Git repositories that were hundreds of times smaller than the monorepo, but these repositories created packages that were consumed by the monorepo, causing complications in tracking dependencies. The hope was that they could merge into the monorepo and rely on sparse-checkout to make it feel like they were still working in a small repository. The user experience was actually much worse, and the root cause was the time it took to read and write the index.

The main culprit is that there are millions of index entries corresponding to files these users do not care about for their daily work. When they push to the server and create a pull request, the build machines can handle the massive scale of building the entire tree and verifying that the small change works within the larger whole of the monorepo. Users should not need to pay that cost.

As members of the Git Fundamentals team, we are very focused on Git performance, and the index has been on our minds for years. For example, the index is a bottleneck for the VFS for Git environment, but that environment has particular needs that prevent improvements in this area.

The biggest thing that has changed recently is the creation of “cone mode” sparse-checkout patterns. These use directory-based pattern matching instead of file-based pattern matching. While cone mode was originally designed to speed up pattern matching in the sparse-checkout feature, it has now unlocked a new way to shrink the index.

The sparse index

The sparse index differs from a normal “full” index in one aspect: it can store directory paths with the object ID for its tree object. This is in addition to the file paths which are paired with blob objects. Since the cone mode sparse-checkout patterns match on a directory level, we can determine that an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path.

Back in my example derrickstolee/trace2-flamegraph repository, you can enable the sparse index command and then use the test-tool read-cache tool to show the contents of the index.

$ git sparse-checkout init --cone --sparse-index
$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
040000 tree b395192a7adbf21793f9489f3623c117802b2043    examples/
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Similar to my previous visualizations, you can now see how the index contains a directory entry in addition to the four blobs.

sparse index

The sparse directory entries correspond to directories that are just outside of the sparse-checkout definition. These directories also have a cache-tree node whose range is only one entry: that sparse directory entry.

I can even display the full details of the --debug output for git ls-files. This currently requires a --sparse flag that I have implemented in my personal fork of Git, but a similar feature will eventually be available in the core Git client.

$ git ls-files --debug --sparse
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 200000
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 200000
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 200000
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
  ctime: 1634910503:288676330
  mtime: 1634910503:288676330
  dev: 16777220 ino: 119325321
  uid: 501  gid: 20
  size: 680 flags: 200000

This output is not truncated as it was before, and you can see that the sparse directory entry for examples/ is the only one with blank filesystem data. It also has the same flags value as the sparse file entries did before.

By removing the number of index entries as well as reducing the average path length, you can shrink the index size significantly. In our example monorepo, most users will reduce their index size from 180 MB to less than 10 MB!

Back to our monorepo, let’s try that git status example again and create a new flamechart. Here, I compare the flamechart for git status with a full index followed by one with a sparse index.

Annotated `git status` flame chart

With the sparse index, the git status command drops from 1.3 seconds to under 200 milliseconds! In the flame chart above I highlighted some regions that have similar appearance in each run. These represent the parts of the git status command that are actually walking the working directory and doing work independent of the index size. Everything else is slower in the full index case entirely because of the size of the index!

Building the sparse index safely

Pruning the index at the directory level is a relatively simple idea with a rather complicated result: our flat list of paths now contains two types of Git objects! There are dozens of places in the Git codebase that interact directly with the index in subtly different ways, and all of them are expecting every index entry to point to a blob object.

In order to make such a change to a critical data structure, we needed to first create a compatibility layer. To safely interact with a sparse index, we needed a way to expand a sparse index to an equivalent full index. This way, code paths that have not been integrated and tested with a sparse index can still be used, even if the on-disk format is sparse.

In the Git codebase, we started by creating the ensure_full_index() method, which converts a sparse index into a full one. This method inspects the list for directory entries and replaces them with its contained file entries. Since the directory is outside of the sparse-checkout cone, we could ignore all of the filesystem metadata information and populate the list by traversing the tree objects under that directory. Before anything else happens, the ensure_full_index() method is called immediately after parsing the index so no interactions with the index happen until the sparse directories are removed.

Expanding to a full index

When Git expands a sparse index to a full one, it scans the entries in lexicographic order. If the entry is a file, then Git copies it to the new list. If the entry is a directory, then the tree at that location is passed to the read_tree_at() method to iterate over all contained blobs. For each contained blob, Git generates an index entry for the corresponding file. Finally, the index entry list is copied back into the index and the index is no longer sparse.

Once that protection was in place, we extended Git to write the sparse index format. When writing the index, a full index is converted to a sparse one in-memory using the convert_to_sparse() method.

Collapsing to a sparse index

To convert from a full index to a sparse one, Git uses the cache-tree extension to find the object ID for our new sparse directory entries. The existing file entries are copied, and Git inserts the directory entries as needed.

Once these steps were built, we could verify that the index size was shrinking to the scale we expected. Both of these steps were included in a single series that introduced the format and implemented the conversions.

While it is nice that the index size has shrunk, we couldn’t stop there. The index is a very compact data structure, so it is more efficient to read it from disk than to recreate it by parsing trees. The ensure_full_index() method takes noticeably longer to expand a sparse index to a full one than it would take to read a full one from disk. In order to gain the performance benefits of a sparse index, we needed to teach Git what to do when encountering sparse directory entries.

Before embarking on these integrations, we first set up more guardrails. A new setting, command_requires_full_index, was created which is enabled by default. This setting is used to trigger ensure_full_index() upon parsing a sparse index unless the Git command being run has explicitly disabled the setting. This allowed us to integrate with Git commands one-by-one without disrupting the behavior of other Git commands. In addition to these settings, we inserted calls to ensure_full_index() before most index interactions. to make sure that we were operating on a full index in any code path that might iterate over the index entries. This allowed us to find which code paths needed integration: we could debug a Git command with a breakpoint on ensure_full_index() and see the call stack that led to that expansion.

The first command to integrate with the sparse index was git status. In hindsight, this was a challenging command to use as a starting point, because it performs multiple index operations that are common to other commands. This became clear when integrating with git checkout and git commit because most of the work was already done in the git status integration.

Let’s explore some of the smaller interactions that needed special care with directory entries.

Example implementation detail: git diff

The git diff command can show what is different between different representations of a working directory. There are two interesting cases that involve the index: comparing the working directory to the index, and comparing the index to a commit.

With no other arguments, git diff shows the differences in the working directory compared to what is staged in the index. This algorithm is mostly simple to integrate with the sparse index: while walking the working directory, Git drills into a directory only if it exists. If the sparse directory entries in the sparse index do not appear as directories in the working directory, then it never tries to drill into the sparse directory entries. If Git finds that a sparse directory entry does exist in the filesystem as a real directory, then the ensure_full_index() method expands the index and Git continues as normal. This is not desired, so we did everything possible to make sure that these directories do not exist, including updating the sparse-checkout feature to delete ignored files outside the cone.

The git diff --cached command compares the files staged in the index with the commit at HEAD. Here, it is easier to have differences outside the sparse-checkout cone, such as when using git reset --soft to change the HEAD commit without changing the working directory or index. In this case, the git diff --cached command wants to compare the root tree for the HEAD commit to the files in the index. This can proceed normally for the files that exist in the sparse index, but when we reach a sparse directory entry, we see the tree object staged in the index as well as a tree object from the tree walk. At this point, we shift from a tree-vs-index comparison to a tree-vs-tree comparison of those subtrees. When that diff is complete, we can continue with the larger comparison.

One major benefit to the tree-vs-tree comparison is that it is easier to compare the same type of objects. The recursive comparison can also prune the walk when it finds two subtrees with the same object ID, as all of their content is the same at that point.

This change allows us to report on these differences not only in the git diff command, but also such diffs as are written during git status, git checkout, and git commit.

Implementation detail: ORT merge strategy

When beginning the sparse index work, there was a huge question that we did not know how to tackle: three-way merges. The default merge strategy, recursive, uses the index as a data structure during its computation. It was going to be difficult to reconcile that algorithm to work within the confines of a sparse index. In fact, many merges that a monorepo user runs would need to resolve merges outside of the sparse-checkout cone.

Luckily, another contributor, Elijah Newren, announced that he was creating a new merge strategy, named the ORT strategy, that did not use the index. We prioritized reviewing and testing that strategy so that we could take advantage of it. It turns out that it is also a better algorithm in general, so it will become the new default strategy with Git 2.34.

The critical feature of the ORT strategy was its replacement of the index with a recursive tree-like structure. That structure is built from the root tree and only creates subtrees for paths that have changed since the merge base. At the end, the ORT strategy creates an index to match its representation of the resulting merge commit.

Because of the ORT merge strategy, integrating the sparse index into git merge, git rebase, git cherry-pick, and git revert was very simple. We just needed to make sure the index that was created at the end of the merge was sparse from its original creation.

Different merge strategies and their performance

As we reported earlier, the ORT strategy improves over the recursive strategy in the typical case, but also the recursive strategy has significant outliers as shown in the box plot above. Enabling the sparse index on top of the ORT strategy provides even more improvements.

Without the ORT merge strategy, the sparse index work could have easily doubled in scope. For a detailed look at the ORT strategy and its many optimizations, take a look at Elijah’s six-part blog series:

Testing the sparse index

The sparse index was touching critical code and doing so in interesting ways. We needed a way to carefully test that these changes were as correct as possible. The Git test suite is substantial and has excellent coverage of most index operations. However, almost all of those tests do not use sparse-checkout, so we couldn’t immediately gain value in checking the sparse index by enabling it globally.

We created a test script that focused on testing the sparse index in a new way. The test starts by creating a repository with some interesting data shapes in it. Then, each test case starts by copying that repository into three new repositories. Those three repositories have different configurations:

  1. The repository as-is, without sparse-checkout.
  2. The repository with cone mode sparse-checkout enabled.
  3. The repository with cone mode sparse-checkout and sparse index enabled.

Then, each test case runs a number of Git commands against all three repositories, expecting the same output and results in the working directory. This allowed us to be confident that the changes we were making to enable the sparse index would have identical behavior with the other two cases.

Along the way, we found some interesting differences between sparse-checkout and full repositories. Several of these bugs have been fixed since. Sometimes, it was unclear whether the sparse-checkout feature should do the same thing as a normal repository, specifically when interacting with files outside of the sparse-checkout cone. This led to changing how some commands interact with sparse entries. In Git 2.34, some commands will need a --sparse flag in order to modify paths outside of the sparse-checkout cone.

In addition to these test scripts, we routinely ran the Scalar functional tests against our development branches, since many of those tests focus on special circumstances around the sparse-checkout feature. If a Scalar test would fail when the Git tests did not, then we would create a similar test in Git to prevent such a bug in the future.

Once we had integrated with a core set of Git commands, we also created an experimental release that contained early versions of these integrations. We provided this version to a subset of monorepo users to evaluate the performance. We found some interesting data from some of the users, but overall the results confirmed that the sparse index was going to significantly improve the user experience in the monorepo. The most important thing we discovered is that the sparse-checkout feature should remove ignored files outside of the cone, as those files cause the sparse index to expand to a full one, negating the performance benefits. There is an additional benefit in that the working directory shrinks even more by deleting these files.

The current state of the sparse index

Not all Git commands understand the sparse index. Those that have not been integrated trigger a compatibility check that converts a sparse index into a full one during the first index read. The integrated commands are ones that have been carefully tested with the sparse index. They likely received some code updates in order to properly handle a sparse index. These integrations were dispersed across the last few Git versions, and some only exist in the microsoft/git fork until we can complete contributing them to the core Git project.


In June, Git 2.32.0 was released with an understanding of the sparse index format. In August, Git 2.33.0 included integrations with git status, git commit, and git checkout. Git 2.34.0 is slated for a November release with integrations for git add, git merge, git rebase, git cherry-pick, and git reset. In order to serve our monorepo users, we fast-tracked some integrations into a pre-release in our fork including integrations with git diff, git blame, git clean, git sparse-checkout, and git stash.

Based on our understanding of how users interact with a monorepo, the commands that are listed here are sufficient to cover almost all users’ needs. We expect that users who adopt the sparse index with these integrations will have a significantly improved experience from before. This all depends on the size of their monorepo and on the size of their sparse-checkout cone.

We will release this feature widely to our monorepo users with the sparse index on by default in the 2.34 release of our Git fork. The core Git project will keep the sparse index off by default until it has all of these features and the implementation has been stable for a few versions.

Looking to the future

At this point, we have covered all of the integrations we need to have a successful monorepo experience. There is more work to be done. In particular, we need to finish contributing the final integrations in our list. Upstream progress takes time and we are grateful for all of the feedback we have received from the community so far. There are more commands that could use integrations, as well.

For now, we are focused on ensuring that monorepo users transition to the sparse index without incident. If you are interested in the sparse index, then we believe it is in an excellent state to start using it. If you have any issues with its performance or stability, then please do not hesitate to create an issue or start a discussion. We will be there to help!

Through the sparse index, we broke through a significant barrier to monorepo scale. We continue to seek the next innovation that will help Git scale to the largest repos out there.

Security keys are now supported for SSH Git operations

Post Syndicated from Kevin Jones original https://github.blog/2021-05-10-security-keys-supported-ssh-git-operations/

GitHub has been at the forefront of security key adoption for many years. We were an early adopter of Universal 2nd Factor (“U2F”) and were also one of the first sites to transition to Webauthn. We’re always on the lookout for new standards that both increase security and usability. Today we’re taking the next step by shipping support for security keys when using Git over SSH.

What are security keys and how do they work?

Security keys, such as the YubiKey, are portable and transferable between machines in a convenient form factor. Most security keys connect via USB, NFC, or Bluetooth. When used in a web browser with two-factor authentication enabled, security keys provide a strong, convenient, and phishing-proof alternative to one-time passwords provided by applications or SMS. Much of the data on the key is protected from external access and modification, ensuring the secrets cannot be taken from the security key. Security keys should be protected as a credential, so keep track of them and you can be confident that you have usable, strong authentication. As long as you retain access to the security key, you can be confident that it can’t be used by anyone else for any other purpose.

Screenshot of GitHub identity verification screen with security key option

Use your existing security key for Git operations

When used for SSH operations, security keys move the sensitive part of your SSH key from your computer to a secure external security key. SSH keys that are bound to security keys protect you from accidental private key exposure and malware. You perform a gesture, such as a tap on the security key, to indicate when you intend to use the security key to authenticate. This action provides the notion of “user presence.”

Security keys are not limited to a single application, so the same individual security key is available for both web and SSH authentication. You don’t need to acquire a separate security key for each use case. And unlike web authentication, two-factor authentication is not a requirement when using security keys to authenticate to Git. As always, we recommend using a strong password, enrolling in two-factor authentication, and setting up account recovery mechanisms. Conveniently, security keys themselves happen to be a great recovery option for securely retaining access to your two-factor-enabled account if you lose access to your phone and backup codes.

The same SSH keys you already know and love, just a little different

Generating and using security keys for SSH is quite similar to how you generated and used SSH keys in the past. You can password-protect your key and require a security key! According to our data, you likely either use an RSA or ed25519 key. Now you can use two additional key types: ecdsa-sk and ed25519-sk, where the “sk” suffix is short for “security key.”

$ ssh-keygen -t ecdsa-sk -C <email address> 
Generating public/private ecdsa-sk key pair. 
You may need to touch your authenticator to authorize key generation.

Once generated, you add these new keys to your account just like any other SSH key. You’ll still create a public and private key pair, but secret bits are generated and stored in the security key, with the public part stored on your machine like any other SSH public key. There is a private key file stored on your machine, but your private SSH key is a reference to the security key device itself. If your private key file on your computer is stolen, it would be useless without the security key. When using SSH with a security key, none of the sensitive information ever leaves the physical security key device. If you’re the only person with physical access to your security key, it’s safe to leave plugged in at all times.

Screenshot of SSH keys associated with GitHub user account

Safer Git access and key management

With security keys, you can achieve a higher level of account security and protection from account takeover. You can take things a step further by removing your previously registered SSH keys, using only SSH keys backed by security keys. Using only SSH keys backed by security keys gives you strong assurance that you are the only person pulling your Git data via SSH as long as you keep the security key safe like any other private key.

Security keys provide meaningful safety assurances even if you only access Git on trusted, consistent systems. At the other end of the spectrum, you might find yourself working in numerous unfamiliar environments where you need to perform Git operations. Security keys dramatically reduce the impact of inadvertent exposure without the need to manage each SSH key on your account carefully. You can confidently generate and leave SSH keys on any system for any length of time and not have to worry about removing access later. We’ll remove unused keys from your account, making key management even easier. Remember to periodically use keys you want to retain over time so we don’t delete them for you.

Protecting against unintended operations

Every remote Git operation will require an additional key tap to ensure that malware cannot initiate requests without your approval. You can still perform local operations, such as checkout, branch, and merge, without interruption. When you’re happy with your code or ready to receive updates, remote operations like push, fetch, and pull will require that you tap your security key before continuing. As always, SSH keys must be present and optionally unlocked with a password for all Git operations. Unlike password-protected SSH keys, clients do not cache security key taps for multiple operations.

$ git clone [email protected]:github/secure_headers.git
Cloning into 'secure_headers'...
Confirm user presence for key ECDSA-SK SHA256:xzg6NAJDyJB1ptnbRNy8UxD6mdm7J/YBdu2p5+fCUa0
User presence confirmed

Already familiar with using SSH keys backed by security keys? In that case, you might wonder why we require verification (via the security key “tap”) when you can configure your security key to allow operations to proceed as long as the security key is present. While we understand the appeal of removing the need for the taps, we determined our current approach to require presence and intention is the best balance between usability and security.

Towards a future with fewer passwords

Today, you can use a password, a personal access token (PAT), or an SSH key to access Git at GitHub. Later this year, as we continue to iterate toward more secure authentication patterns, passwords will no longer be supported for Git operations. We recognize that passwords are convenient, but they are a consistent source of account security challenges. We believe passwords represent the present and past, but not the future. We would rather invest in alternatives, like our Personal Access Tokens, by adding features such as fine-grained access and more control over expiration. It’s a long journey, but every effort to reduce the use of passwords has improved the security of the entire GitHub ecosystem.

By removing password support for Git, as we already successfully did for our API, we will raise the baseline security hygiene for every user and organization, and for the resulting software supply chain. By adding SSH security key support, we have provided a new, more secure, and easy-to-use way to strongly authenticate with Git while preventing unintended and potentially malicious access. If you are ready to make the switch, log in to your account and follow the instructions in our documentation to create a new key and add it to your account.

We wanted to extend our gratitude to Yubico, with whom we’ve partnered several times over the years, for being an early collaborator with us on this feature and providing us valuable feedback to ensure we continue to improve developer security.

Scaling monorepo maintenance

Post Syndicated from Taylor Blau original https://github.blog/2021-04-29-scaling-monorepo-maintenance/

At GitHub, we serve some of the largest Git repositories on the planet. We also serve some of the fastest-growing repositories. Each day, the largest repositories we host become even larger.

About a year ago, we noticed that the job we use to repack Git repositories began hitting our self-imposed timeouts on larger repositories. Even when expanding these timeouts, failing maintenance on these repositories has generally been the cause of degraded performance that is hard to mitigate.

Today, these problems do not exist. GitHub can repack even the largest repositories we host in a fraction of the time it used to take. In this post, we’ll talk about what problems we were encountering, the solutions we built, how we deployed them safely, and describe some possible future directions.

All of our work here is being contributed to the open source Git project, and will be available in an upcoming release.

The problem

Why is GitHub’s maintenance job so expensive in the first place? It’s because we chose to have maintenance repack the entire contents of each repository into a single packfile. Doing so is expensive, but having just one packfile carries some benefits, too. With only one packfile, looking up objects doesn’t require opening and searching through multiple packs to find it. It also means that all objects can be compressed as a delta relative to all other objects (Git’s packfile format supports cross-pack deltas, but currently Git will never store them on disk). But, the most important reason is that reachability bitmaps, a performance-critical data structure, are only compatible with a single pack.

A new feature in Git, multi-pack indexes solves the former problem by making all object lookups go through a single index, but didn’t yet solve the latter. So, we set out to fill in the gaps by bringing bitmap support to multi-pack indexes in order to remove the single-pack limitation on reachability bitmaps.

But in order to build multi-pack bitmaps, we had to solve a number of other interesting problems along the way. First, we had to decide how to arrange the objects in a multi-pack index to achieve good bitmap compression. We also had to figure out how to quickly invert that ordering to translate between bit positions back to the objects they refer to. Some of these steps also yielded notable performance improvements on single-pack repositories, too. Finally, we had to figure out a new repacking strategy that scaled with the size of recent pushes, rather than with the size of the entire repository.

But before we get into all of that, let’s start from the very beginning.

Objects, packs, and fetching

You might be aware that Git stores the contents of your repositories as a set of objects. Each object represents an individual piece of your repository: a single file, tree, commit, or tag. These objects may be stored individually as “loose” objects, or together in a packfile.

If you’ve ever heard that a Git repository is “nothing more than a directed acyclic graph,” then you know that these objects can refer to one another. A tree refers to a set of blobs (which correspond to files) and other trees (which correspond to sub-directories). A commit refers to a single tree (the repository root), and zero or more other commits which are its parents.

These links help Git figure out which objects it needs to transfer to fulfill a fetch or clone request. When you fetch a repository from GitHub, Git performs a negotiation to figure out which objects to send. The server advertises the objects at the tip of its references (basically the tips of branches and tags). The client does the same, along with the set of references that it wants from the server. Then, the server walks the links between the requested objects and the objects that the client already has in order to figure out what to send.

Above is an object graph. The client advertises its ref tips (indicated by the darker blue commits). The server’s advertised references are colored dark red. The blue shaded area represents the result of walking along the edges to obtain the reachability closure of the objects the client already has. The red shaded area represents the same closure from the server’s perspective, excluding objects that the client already has. Objects in this region are the ones which need to be sent to the client.

During a fetch, Git needs to send not just the commits in between what the client has and wants, but also all of the objects that are reachable from those commits. Because Git doesn’t store the list of every reachable object, this may be expensive, especially in the case of a clone. When cloning, the client doesn’t have any objects, so it asks the server for all of the objects reachable from any reference.

In an earlier blog post, we talked about how we use reachability bitmaps to accelerate this negotiation. In case you haven’t read that post, below is a quick primer.

Reachability bitmaps

How does Git handle this special case where the client has nothing and wants everything? Ultimately, the server needs to determine the reachability closure of all of the reference tips. In other words, it needs a list of all of the objects at the reference tips, and all of the ancestors of those objects in order to assemble a complete copy of the repository at the other end.

Unfortunately for us, the larger the repository is, the longer it takes Git to compute the list of objects to send. This isn’t feasible even for medium-sized repositories. Git could handle our case specially by just sending every object it has, but that might result in many unwanted objects being sent, too. (For example, GitHub stores the outcome of “test merges” in special refs which aren’t ordinarily sent during fetches, but whose objects are stored in the same object directory nonetheless.)

Instead, Git stores a set of reachability bitmaps corresponding to some of the commits in a packfile. The idea is rather simple: arrange the objects in a pack in some order (the particular order used is something we’ll discuss shortly in detail). Then, the ith bit in the bitmap corresponding to commit C is 1 if C can reach the ith object, and 0 otherwise.

Having a one-to-one correspondence between objects and bit positions has a couple of appealing properties: taking the union of reachable objects between commits is as simple as ORing their bitmaps together, and taking the difference is as simple as combining AND and NOT. So, when a bitmap exists, Git can dramatically speed up the object negotiation phase:

  • First, OR all of the bitmaps corresponding to the reference tips that the client wants. Call this new bitmap W.
  • Then, do the same with the bitmaps corresponding to the reference tips that the client advertised as already having. Call this bitmap H.
  • Finally, compute W AND NOT H to produce the set of objects the client needs (in other words, everything it wants but does not already have). Then, send those objects.

Because all of the reachability information is encoded directly into the bitmaps, Git saves time by avoiding the need to open up and parse objects, allowing it to produce the same result in a fraction of the time.

This idea has been used since at least the 1970s to speed up queries in relational databases. In Git, reachability bitmaps can provide dramatic speed-ups when walking objects that reside in the same pack: walking all of the objects in the Linux kernel repository took more than 33 seconds without bitmaps, but only 1.57 seconds to perform the same traversal with bitmaps.

The object order

How does Git turn a set of objects into a sequence of bit positions? One way you might imagine to do this is assign bit positions in lexicographic order. The first bit corresponds to 000023961a, the second to 0000d6543f, the third to 000182eacf, and so on in lexical order.

Why not do this? Recall that an object’s ID is determined by a SHA-1 of its contents, which means that reachability of any object in this order is only reachable to nearby objects by chance. And reachability of nearby objects matters: Git compresses the bitmaps using EWAH compression, which relies on having long runs of identical bits. If the object order makes reachability look essentially random from bit to bit, Git won’t be able to efficiently compress the bitmaps.

Pack order—that is, the physical arrangement of objects in a packfile—produces a sequence of bit positions that tends to place reachable objects next to each other. And this produces exactly the kind of long runs of identical bits that make EWAH compression perform well.

The problem

But, all of this creates a problem for us: if the order of bit positions is dictated by a pack, then bitmaps are coupled in implementation and in concept to the existence of a single packfile. So, any objects that accumulate outside of the bitmapped pack won’t benefit from the same speed-ups.

To address this, we periodically repack the repository’s entire contents into a single pack, and then generate a new reachability bitmap. This makes reachability queries in more recent parts of the repository’s history faster.

But generating that new pack takes time; in fact, it’s quadratic. Bigger repositories take longer to repack, but also grow at a faster rate, which means they run maintenance more often. This compounding effect sometimes makes it such that some repositories are constantly undergoing maintenance: by the time one maintenance job has finished, another is already sitting in the queue, waiting to be run.

Since the bottleneck for maintenance is the compression of an entire repository’s contents into a single packfile, what would it take to be able to repack the contents into multiple packfiles instead?

Multi-pack indexes

To order a set of objects spanning multiple packs, we looked to a recent Git feature: multi-pack indexes.

When Git wants to locate an object by name in a single pack, it uses that pack’s index (.idx) file, which provides a binary-searchable list of object locations. Multi-pack indexes work similarly to .idx files, but the location they indicate is a pair containing both the pack containing the object, as well as where in that pack the object can be found.

The figure above gives a flavor for what kind of data is organized in a multi-pack index. Here, there are three packs, each with a handful of objects. The multi-pack index stores the location of each unique object among the set of packs. When multiple copies of an object exist (like the green or red objects in packs xyz and abc), ties are broken in favor of the copy in the pack with the earliest modification time.

The order of objects in the multi-pack index differs from the order in each individual pack. Since a pack is free to store objects in any order it wants, the multi-pack index stores objects in lexicographic order so that an object can be found quickly by name using a binary search.

Ordering objects

Given a multi-pack index, how should we order the objects in the packs it contains? We discussed earlier that ordering objects lexicographically results in poor compression. We also noted that objects in packs are ordered topologically, which for our purposes we can consider to imply that individual objects tend to appear near other objects which they reach in this order.

So, any ordering of the objects in a multi-pack index should capture as much of those two properties as possible. With that in mind, we decided on the following order:

  • Objects are first grouped according to which packfile they appear in, and packs are ordered according to the multi-pack index.
  • Objects within the same pack should be ordered according to their locations within that pack.

This effectively concatenates the pack-order of multiple packs together, according to some other ordering defined on the packs themselves.

To see what this looks like, let’s overlay a portion of a bitmap that covers the objects in our earlier example:

The first three bits correspond to the red, yellow, and green objects, respectively. Each one of those objects comes from the xyz pack, which means that the xyz pack has the oldest modification among the three. Scanning the bitmap from left to right, these objects appear in pack order; that is, the byte offset of the red object is less than the byte offsets of the yellow and green objects that follow it.

The purple and blue objects come next, since they are in the pack that follows. But note that the copies of the red and green objects in the abc pack don’t correspond to any bits highlighted. Why? Because the multi-pack index selected the copy of those objects in the earlier pack.

Finally, the orange and pink objects appear, also in pack order. And, as we expect, the copy of the purple object that appears in pack 123 isn’t included in the bitmap, because the copy in pack abc was.

This ordering gives us great locality, but we still need to address how to map bit positions back to the objects they represent. For example, let’s look at the fifth bit position, which we know refers to the blue object: how could Git discover this same fact?

You could reasonably imagine that knowing how many total objects are in each pack would be good enough to figure out which objects each bit corresponds to. But that isn’t enough information; we don’t know how many unique objects selected by the multi-pack index appear in each pack, and we also don’t know which non-unique objects are missing. So it’s not good enough to count past the three bits corresponding to objects in pack xyz and then count two more bits up to the fifth bit, because that would point at the copy of the green object in pack abc.

Reverse indexes

To solve this problem, we introduced reverse indexes. In the same way that the pack index provides a mapping from object name to object location, the reverse index maps an object’s location back to its name.

The idea is simple: in addition to the pack’s contents (stored in a .pack file) and index (stored in an .idx file), we’ll write an array of numbers (which comprise the reverse index, and are stored in a new .rev file). These numbers provide a mapping between an object’s position in pack order and its position in lexicographic order.

To better understand this, let’s take a look at an example on a single pack.

The .idx file (shown in the lower left) lists objects in lexicographic order: the yellow object comes before the red one, and the red one comes before the green one. But their pack order is different: there, the red object comes before the yellow one instead of the other way around. The reverse index helps us unwind the two: it tells us that the red object is in the position 3 in lexicographic order, and the yellow object is in position 1.

The reverse index allows us to map quickly from offsets into the packfile to object positions they correspond to. That allows us to quickly determine the size of a packed object. For example, say that you want to figure out how large the red object is. Because Git doesn’t store that information directly, you have one of two options: either scan linearly through the packed data (inflating its contents until you locate the stream end), or locate the adjacent object (in this case, the yellow one) by name, and measure the difference of their offsets.

Without the reverse index, there is no way to figure out where the adjacent object starts. But with a reverse index, locating the red object is as simple as reading the adjacent entry in the reverse index to discover the offset. Here, that value is 1, which points at an entry in the .idx file, which in turn points at the location in the pack.

Before the on-disk reverse index, Git computed this table on the fly and stored the results in an array in memory. This had a couple of major downsides. To build a reverse index on the fly, Git has to allocate a pair of pack offset and index position. This requires memory and runtime, which both scale relative to the size of the pack. Even though this processes using radix sort, sorting the reverse index entries can be noticeably slow when done once per process.

Some initial testing of these reverse indexes showed that they could enable rather dramatic speed-ups when serving fetches on real-world repositories. To verify our early results, we gradually rolled out reverse indexes on a handful of repositories.

Below is the 50th percentile of CPU time for fetches to Homebrew/homebrew-core on our testing host before and after the change:

In this case, we reduced the amount of time it took to serve any fetch of that repository by around 80%. Here’s another plot from the same time which shows the resident set size of the program used to serve fetches by replicas of that same repository:

Encouraged by our first results in production, we proceeded to cautiously roll out on-disk reverse indexes to all other replicas. Once we completed the roll-out, we let a couple of days pass in order for replicas to have enough time to generate reverse indexes. Then, we sampled the overall CPU time it took to serve fetches across all repositories and observed the drop we were hoping for:

In the graph above, you can see three 24-hour cycles. The first day was before we rolled out .rev files, and the second two peaks are after. The per-day peaks dropped from around 10.8 seconds to 7 seconds, for a collective savings of around 35% on all repositories.

Multi-pack bitmaps

Now that we have a format for .rev files, we can reuse it as the missing piece we need to build multi-pack bitmaps. Instead of writing positions relative to a pack .idx file, a multi-pack reverse index writes positions which are relative to a multi-pack-index file.

This provides exactly the information we need to rediscover the mapping between bit positions and the objects they refer to in the multi-pack index. To see why, let’s take a look at another example:

To figure out that the fifth bit corresponds to the blue object in the above diagram, we read the fifth entry in the multi-pack reverse index and get back that the fifth bit maps to the eleventh object in the multi-pack index. And, sure enough, the 11th object points back to the blue object that we were looking for in pack abc.

Putting it all together, this gives us a bitmap which can refer to objects in multiple packs. Based on its filename, each bitmap knows whether or not it belongs to a pack (and if so, which one) or to a multi-pack index. And based on that information, it can translate its object lookups to be relative to a packfile, or to the multi-pack index via its reverse index.

Since we chose the ordering carefully, these multi-pack bitmaps compress exactly as well as their single-pack counterparts. And they decouple bitmaps from individual packs. So, a repository can still have at most one bitmap, but that bitmap can now correspond to multiple packs.

Geometric repacking

Now that we can include multiple packs in a single bitmap, what’s the best way to repack a repository during maintenance?

With single-pack bitmaps, the only option was to pack everything together into one enormous pack. But now that this restriction no longer exists, we have to figure out the best way to repack the objects. When deciding on a new repacking strategy, we wanted something that struck a balance between two properties:

  • On average, there is a relatively small number of packs in the repository.
  • On average, we create a pack that collects objects that were pushed since the last maintenance run.

We decided on an invariant to ensure that the packs in a repository form a geometric progression by object size. In particular, if you sort the packs by number of objects, with the first pack having the most and the final pack having the least, then each pack will contain at least twice as many objects as the next one.

We taught git repack a new --geometric= mode, which creates exactly this geometric progression. Soon (at the time of writing, these patches are still being submitted and reviewed), you’ll be able to try this yourself on your own repository by running:

$ packsizes() {
    find .git/objects/pack -type f -name '*.pack' |
    while read pack; do
      printf "%7d %s\n" \
        "$(git show-index < ${pack%.pack}.idx | wc -l)" "$pack"
    done | sort -rn
$ packsizes # before
$ git repack --write-midx --write-bitmap-index -d --geometric=2
$ packsizes # after

How does this command work? We select a set of packs and combine their objects into one new pack that replaces the set. But picking this set optimally is NP-hard, so we have to approximate it.

To see how we perform that approximation, let's walk through an example. The first step is to figure out how many packs already form a geometric progression. To do this, imagine ordering packs by how many objects they contain. Then, consider each adjacent pair of packs from largest to smallest. At each step, ask: "is the larger pack at least twice the size of the smaller one?". If so, then every pack from that pair onwards already forms a geometric progression.

In this example, the second and third packs (each containing one object) violate our progression. At that point, we know that the second pack, and any packs below it must be repacked together in order to restore the progression.

But we can't just repack the first two packs together, since the combined pack would still be too big (it would contain two objects, and the third pack only contains one). So we grow the set of packs to combine until the invariant is restored:

Here, we had to combine the first four packs together in order to restore a geometric progression. Those packs together contain 7 objects, which is less than half of the next-largest pack (which contains 32 objects). And we can't combine any more packs, since doing so would violate the remainder of the progression.

At this point, we can write a new pack containing just the objects in the set of packs we combined in the previous step. After discarding the now-redundant packs, the remaining packs again form a geometric progression:

This means that the number of packs grow logarithmically over time, so a repository will never have too many packs at any one point in time. It also has the appealing property that older objects tend to get pushed into the larger packs, meaning they get repacked less frequently as time passes. An important corollary of this is that each repack tends to focus on the most recently pushed objects. In other words, this strategy tends to make repacking take time proportional to the number of new objects, not the number of overall objects.

In order to keep the repository performing well over many geometric repacks, we intersperse an all-into-one repack once for every eight geometric repacks.

Importantly, this means that even though the classically-slow repacks are still slow, we aren't forced to run them every time we want to repack a repository.

Deploying to production

Now that we have a way to write a bitmap that covers the objects in multiple packs, a way to quickly map between bit positions and the objects they refer to in a multi-pack index, and a repacking strategy which only requires modifying recent additions to the repository, we are ready to put everything together.

Before rolling out a change as large as this one, we performed extensive local testing to ensure that writing bitmaps worked correctly, and that our new repacking strategy wasn't silently corrupting repositories. But our local testing can only go so far: there are endless corner cases when serving fetches and clones, so the real test occurs only after putting real traffic in front of these new paths. Our goal was to design a deployment strategy that simultaneously exercised enough of those cases, while also ensuring that any potential corruption could never occur on a majority of repository replicas.

Our first test cases were internal repositories. We broke our original tests into two phases. In the first phase, we wrote "multi-pack bitmaps" which contained only a single pack. This allowed us to exercise the most basic case of multi-pack bitmaps (having only one pack) without running our new repack code. Once we had built confidence in that approach, we expanded our test to alternate between geometric and full repacks.

After a couple of weeks without any issues, we were confident enough in our change to start testing it on other repositories external to GitHub. We selected first an individual host, and then a whole rack of hosts on which every repack would alternate between geometric and full. At this stage, results were encouraging: the average time to repack had dropped significantly, as had the overall amount of time spent repacking.

By this stage, our roll-out proceeded for several weeks in only one of three data centers. Because we never place a majority of repository replicas in any single datacenter, this configuration made it impossible for our changes to corrupt a majority set of replicas in an unrecoverable fashion, while still putting our changes in the request path for a large amount of traffic.

Finally, after adopting this configuration for a week, we proceeded to enroll percentages of replicas hosted in other datacenters to also use multi-pack bitmaps until all repositories were using multi-pack reachability bitmaps.

After rolling out our new repacking strategy with multi-pack bitmaps, we saved on average 5.67 CPU days every hour compared to the old strategy.

Likewise, the average time spent repacking any single repository also dropped considerably. Below, the plot is broken out per-site, and you can see when we began testing in a single site, as well as when we expanded our deployment to all sites.

There, the average dropped from around 1 minute to repack a repository to just 15 seconds.

Future directions

There are two major open areas we're considering for the future that will make it possible for further performance improvements:

One open area is the bitmap computation itself. Git's bitmap generation code can reuse existing bitmaps by permuting their bits into a new order, but this operation can still take time proportional to the size of the repository. One way to solve this would be to write bitmaps incrementally, only walking the new objects introduced since the last time a bitmap was written. This problem is tricky because it requires not only that the bitmap file be able to be written incrementally, but also that the object ordering we select is stable: that is, that introducing new bitmaps won't render the existing ones meaningless.

Another open area is the pack structure. Creating packs that form a geometric sequence is a promising step forward that allows us to trade off between full and partial repacks. But some repositories are so large that repacking the whole repository isn't feasible, much less desirable. Designing a strategy which freezes the packs containing the oldest objects in a repository's history will allow us to grow to support even larger repositories in the future.

Thank you

This project would not have been possible without help from the upstream Git community, as well as many engineering teams within GitHub. Each of these changes required extensive review, both to the Git project, as well as to internal GitHub services. Special thanks to Jeff King, Derrick Stolee, Jonathan Tan, Junio Hamano, and others for making this possible.

Scalable agile development practices based on AWS CodeCommit

Post Syndicated from Mengxin Zhu original https://aws.amazon.com/blogs/devops/scalable-agile-development-practices-based-on-aws-codecommit/

Development teams use agile development processes based on Git services extensively. AWS provides AWS CodeCommit, a managed, Git protocol-based, secure, and highly available code service. The capabilities of CodeCommit combined with other developer tools, like AWS CodeBuild and AWS CodePipeline, make it easy to manage collaborative, scalable development process with fine-grained permissions and on-demand resources.

You can manage user roles with different AWS Identity and Access Management (IAM) policies in the code repository of CodeCommit. You can build your collaborative development process with pull requests and approval rules. The process described in this post only requires you to manage the developers’ role, without forking the source repository for individual developers. CodeCommit pull requests can integrate numerous code analysis services as approvers to improve code quality and mitigate security vulnerabilities, such as SonarQube static scanning and the ML-based code analysis service Amazon CodeGuru Reviewer.

The CodeCommit-based agile development process described in this post has the following characteristics:

  • Control permissions of the CodeCommit repository via IAM.
    • Any code repository has at least two user roles:
      • Development collaborator – Participates in the development of the project.
      • Repository owner – Has code review permission and partial management permissions of the repository. The repository owner is also the collaborator of the repository.
    • Both development collaborator and owner have read permissions of the repository and can pull code to local disk via the Git-supported protocols.
    • The development collaborator can push new code to branches with a specific prefix, for example, features/ or bugs/. Multiple collaborators can work on a particular branch for one pull request. Collaborators can create new pull requests to request merging code into the main branch, such as the mainline branch.
    • The repository owner has permission to review pull requests with approval voting and merge pull requests.
    • Directly pushing code to the main branch of repository is denied.
  • Development workflow. This includes the following:
    • Creating an approval template rule of CodeCommit that requires at least two approvals from the sanity checking build of the pull request and repository owner. The workflow also applies the approval rule to require mandatory approvals for pull requests of the repository.
    • The creation and update of source branch events of pull requests via Amazon EventBridge triggers a sanity checking build of CodeBuild to compile, test, and analyze the pull request code. If all checks pass, the pull request gets an approval voting from the sanity checking build.
    • Watching the main branch of the repository triggers a continuous integration for any commit. You can continuously publish artifacts of your project to the artifact repository or integrate the latest version of the service to your business system.

This agile development process can use AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) to orchestrate AWS resources with the best practice of infrastructure as code. You can manage hundreds of repositories in your organization and automatically provision new repositories and related DevOps resources from AWS after the pull request of your IaC as a new application is approved. This makes sure that you’re managing the code repository and DevOps resources in a secure and compliant way. You can use it as a reference solution for your organization to manage large-scale R&D resources.

Solution overview

In the following use case, you’re working on a Java-based project AWS Toolkit for JetBrains. This application has developers that can submit code via pull requests. Each pull request is automatically checked and validated by CodeBuild builds. The owners of the project can review the pull request and merge it to the main branch. The code submitted to the main branch triggers the continuous integration to build the project artifacts.

The following diagram illustrates the components built in this post and their role in the DevOps process.

architecture diagram


For this walkthrough, you should meet the following prerequisites:

Preparing the code

Clone the sample code from the Github repo with your preferred Git client or IDE and view branch aws-toolkit-jetbrains, or download the sample code directly and unzip it into an empty folder.

Initializing the environment

Open the terminal or command prompt of your operating system, enter the directory where the sample code is located, enter the following code to initialize the environment, and install the dependency packages:

npm run init

Deploying application

After successfully initializing the AWS CDK environment and installing the dependencies of the sample application, enter the following code to deploy the application:

npm run deploy

Because the application creates the IAM roles and policies, AWS CDK requires you to confirm security-related changes before deploying it. You see the following outputs from the command line.

deploy stack

Enter y to confirm the security changes, and AWS CDK begins to deploy the application. After a few minutes, you see output similar to the following code, indicating that the application stack has been successfully deployed in your AWS account:

✅  CodecommitDevopsModelStack

CodecommitDevopsModelStack.Repo1AdminRoleOutput = arn:aws:iam::012345678912:role/codecommitmodel/CodecommitDevopsModelStack-Repo1AdminRole0648F018-OQGKZPM6T0HP
CodecommitDevopsModelStack.Repo1CollaboratorRoleOutput = arn:aws:iam::012345678912:role/codecommitmodel/CodecommitDevopsModelStac-Repo1CollaboratorRole1EB-15KURO7Z9VNOY

Stack ARN:

As shown in the preceding code, the output of successful deployment indicates that the ARN of two IAM roles were created on behalf of the owner and development collaborator of the source code repository.

Checking deployment results

After successfully deploying the app, you can sign in to the CodeCommit console and browse repositories. The following screenshot shows three repositories.

created repos

For this post, we use three repositories to demonstrate configuring the different access permissions for different teams in your organization. As shown in the following screenshot, the repository CodeCommitDevopsModelStack-MyApp1 is tagged to grant permissions to the specific team abc.

repository tags

The IAM roles for the owner and development collaborator only have access to the code repository with the following tags combination:

 'app': 'my-app-1',
 'team': 'abc',

Configuring CodeCommit repository access on behalf of owner and collaborator

Next, you configure the current user to simulate the owner and development collaborator via IAM’s AssumeRole.

Edit the AWS CLI profile file with your preferred text editor and add the following configuration lines:

[profile codecommit-repo1-owner]

role_arn = <the ARN of owner role after successfully deploying sample app>

source_profile = default

region = ap-southeast-1


[profile codecommit-repo1-collaborator]

role_arn = <the ARN of collaborator role after successfully deploying sample app>

source_profile = default

region = ap-southeast-1


Replace the role_arn in the owner and collaborator sections with the corresponding output after successfully deploying the sample app.

If the AWS CLI isn’t using the default profile, replace the value of source_profile with the profile name you’re currently using.

Make the region consistent with the value configured in source_profile. For example, this post uses ap-southeast-1.

After saving the modification of the profile, you can test this configuration from the command line. See the following code:

export AWS_DEFAULT_PROFILE=codecommit-repo1-owner # assume owner role of repository

aws sts get-caller-identity # get current user identity, you should see output like below,
    "UserId": "AROAQP3VLCVWYYTPJL2GW:botocore-session-1587717914",
    "Account": "0123456789xx",
    "Arn": "arn:aws:sts::0123456789xx:assumed-role/CodecommitDevopsModelStack-Repo1AdminRole0648F018-1SNXR23P4XVYZ/botocore-session-1587717914"

aws codecommit list-repositories # list of all repositories of AWS CodeCommit in configured region
    "repositories": [
            "repositoryName": "CodecommitDevopsModelStack-MyApp1",
            "repositoryId": "208dd6d1-ade4-4633-a2a3-fe1a9a8f3d1c "
            "repositoryName": "CodecommitDevopsModelStack-MyApp2",
            "repositoryId": "44421652-d12e-413e-85e3-e0db894ab018"
            "repositoryName": "CodecommitDevopsModelStack-MyApp3",
            "repositoryId": "8d146b34-f659-4b17-98d8-85ebaa07283c"

aws codecommit get-repository --repository-name CodecommitDevopsModelStack-MyApp1 # get detail information of repository name ends with MyApp1
    "repositoryMetadata": {
        "accountId": "0123456789xx",
        "repositoryId": "208dd6d1-ade4-4633-a2a3-fe1a9a8f3d1c",
        "repositoryName": "CodecommitDevopsModelStack-MyApp1",
        "repositoryDescription": "Repo for App1.",
        "lastModifiedDate": "2020-06-24T00:06:24.734000+08:00",
        "creationDate": "2020-06-24T00:06:24.734000+08:00",
        "cloneUrlHttp": "https://git-codecommit.ap-southeast-1.amazonaws.com/v1/repos/CodecommitDevopsModelStack-MyApp1",
        "cloneUrlSsh": "ssh://git-codecommit.ap-southeast-1.amazonaws.com/v1/repos/CodecommitDevopsModelStack-MyApp1",
        "Arn": "arn:aws:codecommit:ap-southeast-1:0123456789xx:CodecommitDevopsModelStack-MyApp1"

aws codecommit get-repository --repository-name CodecommitDevopsModelStack-MyApp2 # try to get detail information of repository MyApp2 that does not have accessing permission by the role

An error occurred (AccessDeniedException) when calling the GetRepository operation: User: arn:aws:sts::0123456789xx:assumed-role/CodecommitDevopsModelStack-Repo1AdminRole0648F018-OQGKZPM6T0HP/botocore-session-1593325146 is not authorized to perform: codecommit:GetRepository on resource: arn:aws:codecommit:ap-southeast-1:0123456789xx:CodecommitDevopsModelStack-MyApp2

You can also grant IAM policies starting with CodecommitDevopsmodelStack-CodecommitCollaborationModel to existing IAM users for the corresponding owner or collaborator permissions.

Initializing the repository

The new code repository CodecommitdevopsmodelStack-MyApp1 is an empty Git repository without any commit. You can use the AWS Toolkit for JetBrains project as the existing local codebase and push the code to the repository hosted by CodeCommit.

Enter the following code from the command line:

export AWS_DEFAULT_PROFILE=codecommit-repo1-owner # assume owner role of repository

git clone https://github.com/aws/aws-toolkit-jetbrains.git # clone aws-toolkit-jetbrains to local as existing codebase

cd aws-toolkit-jetbrains

git remote add codecommit codecommit::ap-southeast-1://CodecommitDevopsModelStack-MyApp1 # add CodeCommit hosted repo as new remote named as codecommit. Follow the doc set up AWS CodeCommit with git-remote-codecommit, or use remote url of repository via https/ssh protocol

git push codecommit master:init  # push existing codebase to a temporary branch named 'init'

aws codecommit create-branch --repository-name CodecommitDevopsModelStack-MyApp1 --branch-name master --commit-id `git rev-parse master` # create new branch 'master'

aws codecommit update-default-branch --repository-name CodecommitDevopsModelStack-MyApp1 --default-branch-name master # set branch 'master' as main branch of repository

aws codecommit delete-branch --repository-name CodecommitDevopsModelStack-MyApp1 --branch-name init # clean up 'init' branch

Agile development practices

For this use case, you act as the collaborator of the repository implementing a new feature for aws-toolkit-jetbrains, then follow the development process to submit your code changes to the main branch.

Enter the following code from the command line:

export AWS_DEFAULT_PROFILE=codecommit-repo1-collaborator # assume collaborator role of repository

# add/modify/delete source files for your new feature

git commit -m 'This is my new feature.' -a

git push codecommit HEAD:refs/heads/features/my-feature # push code to new branch with prefix /features/

aws codecommit create-pull-request --title 'My feature "Short Description".' --description 'Detail description of feature request'  --targets repositoryName=CodecommitDevopsModelStack-MyApp1,sourceReference=features/my-feature,destinationReference=master # create pull request for new feature

The preceding code submits the changes of the new feature to a branch with the prefix features/ and creates a pull request to merge the change into the main branch.

On the CodeCommit console, you can see that a pull request called My feature "Short Description". created by the development collaborator has passed the sanity checking build of the pull request and gets an approval voting (it takes about 15 minutes to complete the checking build in this project).

PR build result


The owner of the repository also needs to review the pull request with one approval at least, then they can merge the repository to the main branch. The pull request on the CodeCommit console supports several code review features, such as change comparison, in-line comments, and code discussions. For more information, see Using AWS CodeCommit Pull Requests to request code reviews and discuss code. The following screenshot shows the review tool on the CodeCommit console, on the Changes tab.

CodeReview Tool


The following screenshot shows the approval details of the pull request, on the Approvals tab.

Approvals tab

When browsing the continuous integration deployment project after merging the pull request, you can see that a new continuous integration build has been triggered by the event of merging the pull request to the main branch.

Deployment build

Cleaning up

When you’re finished exploring this use case and discovering the deployed resources, the last step is to clean up your account. The following code deletes all the resources you created:

npm run cleanup


This post discussed agile development practices based on CodeCommit, including implementation mechanisms and practice processes, and demonstrated how to collaborate in development under those processes. AWS powers the code that manages the code repository itself and the DevOps processes built around it in the example application. You can use the IaC capability of AWS and apply those practices in your organization to build compliant and secure R&D processes.

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/756489/rss

Security updates have been issued by CentOS (procps, xmlrpc, and xmlrpc3), Debian (batik, prosody, redmine, wireshark, and zookeeper), Fedora (jasper, kernel, poppler, and xmlrpc), Mageia (git and wireshark), Red Hat (rh-java-common-xmlrpc), Slackware (git), SUSE (bzr, dpdk-thunderxdpdk, and ocaml), and Ubuntu (exempi).

Microsoft acquires GitHub

Post Syndicated from corbet original https://lwn.net/Articles/756443/rss

Here’s the
press release
announcing Microsoft’s agreement to acquire GitHub for a
mere $7.5 billion. “GitHub will retain its developer-first
ethos and will operate independently to provide an open platform for all
developers in all industries. Developers will continue to be able to use
the programming languages, tools and operating systems of their choice for
their projects — and will still be able to deploy their code to any
operating system, any cloud and any device.

Kernel 4.17 released

Post Syndicated from corbet original https://lwn.net/Articles/756373/rss

Linus has released the 4.17 kernel, which
will indeed be called “4.17”.
No, I didn’t call it 5.0, even though all the git object count
numerology was in place for that. It will happen in the not _too_
distant future, and I’m told all the release scripts on kernel.org are
ready for it, but I didn’t feel there was any real reason for it.

Headline features in this release include
improved load estimation in the CPU
BPF tracepoints
lazytime support in the XFS filesystem,
full in-kernel TLS protocol support,
histogram triggers for tracing,
mitigations for the latest Spectre variants,
and, of course, the removal of support for eight unloved processor

Storing Encrypted Credentials In Git

Post Syndicated from Bozho original https://techblog.bozho.net/storing-encrypted-credentials-in-git/

We all know that we should not commit any passwords or keys to the repo with our code (no matter if public or private). Yet, thousands of production passwords can be found on GitHub (and probably thousands more in internal company repositories). Some have tried to fix that by removing the passwords (once they learned it’s not a good idea to store them publicly), but passwords have remained in the git history.

Knowing what not to do is the first and very important step. But how do we store production credentials. Database credentials, system secrets (e.g. for HMACs), access keys for 3rd party services like payment providers or social networks. There doesn’t seem to be an agreed upon solution.

I’ve previously argued with the 12-factor app recommendation to use environment variables – if you have a few that might be okay, but when the number of variables grow (as in any real application), it becomes impractical. And you can set environment variables via a bash script, but you’d have to store it somewhere. And in fact, even separate environment variables should be stored somewhere.

This somewhere could be a local directory (risky), a shared storage, e.g. FTP or S3 bucket with limited access, or a separate git repository. I think I prefer the git repository as it allows versioning (Note: S3 also does, but is provider-specific). So you can store all your environment-specific properties files with all their credentials and environment-specific configurations in a git repo with limited access (only Ops people). And that’s not bad, as long as it’s not the same repo as the source code.

Such a repo would look like this:

└─── production
|   |   application.properites
|   |   keystore.jks
└─── staging
|   |   application.properites
|   |   keystore.jks
└─── on-premise-client1
|   |   application.properites
|   |   keystore.jks
└─── on-premise-client2
|   |   application.properites
|   |   keystore.jks

Since many companies are using GitHub or BitBucket for their repositories, storing production credentials on a public provider may still be risky. That’s why it’s a good idea to encrypt the files in the repository. A good way to do it is via git-crypt. It is “transparent” encryption because it supports diff and encryption and decryption on the fly. Once you set it up, you continue working with the repo as if it’s not encrypted. There’s even a fork that works on Windows.

You simply run git-crypt init (after you’ve put the git-crypt binary on your OS Path), which generates a key. Then you specify your .gitattributes, e.g. like that:

secretfile filter=git-crypt diff=git-crypt
*.key filter=git-crypt diff=git-crypt
*.properties filter=git-crypt diff=git-crypt
*.jks filter=git-crypt diff=git-crypt

And you’re done. Well, almost. If this is a fresh repo, everything is good. If it is an existing repo, you’d have to clean up your history which contains the unencrypted files. Following these steps will get you there, with one addition – before calling git commit, you should call git-crypt status -f so that the existing files are actually encrypted.

You’re almost done. We should somehow share and backup the keys. For the sharing part, it’s not a big issue to have a team of 2-3 Ops people share the same key, but you could also use the GPG option of git-crypt (as documented in the README). What’s left is to backup your secret key (that’s generated in the .git/git-crypt directory). You can store it (password-protected) in some other storage, be it a company shared folder, Dropbox/Google Drive, or even your email. Just make sure your computer is not the only place where it’s present and that it’s protected. I don’t think key rotation is necessary, but you can devise some rotation procedure.

git-crypt authors claim to shine when it comes to encrypting just a few files in an otherwise public repo. And recommend looking at git-remote-gcrypt. But as often there are non-sensitive parts of environment-specific configurations, you may not want to encrypt everything. And I think it’s perfectly fine to use git-crypt even in a separate repo scenario. And even though encryption is an okay approach to protect credentials in your source code repo, it’s still not necessarily a good idea to have the environment configurations in the same repo. Especially given that different people/teams manage these credentials. Even in small companies, maybe not all members have production access.

The outstanding questions in this case is – how do you sync the properties with code changes. Sometimes the code adds new properties that should be reflected in the environment configurations. There are two scenarios here – first, properties that could vary across environments, but can have default values (e.g. scheduled job periods), and second, properties that require explicit configuration (e.g. database credentials). The former can have the default values bundled in the code repo and therefore in the release artifact, allowing external files to override them. The latter should be announced to the people who do the deployment so that they can set the proper values.

The whole process of having versioned environment-speific configurations is actually quite simple and logical, even with the encryption added to the picture. And I think it’s a good security practice we should try to follow.

The post Storing Encrypted Credentials In Git appeared first on Bozho's tech blog.

Security updates for Friday

Post Syndicated from ris original https://lwn.net/Articles/756260/rss

Security updates have been issued by Debian (kernel, procps, and tiff), Fedora (ca-certificates, chromium, and git), Mageia (kernel, kernel-linus, kernel-tmb, and libvirt), openSUSE (chromium and xen), Oracle (procps, xmlrpc, and xmlrpc3), Red Hat (xmlrpc and xmlrpc3), Scientific Linux (procps, xmlrpc, and xmlrpc3), SUSE (HA kernel modules and kernel), and Ubuntu (libytnef and python-oslo.middleware).

Amazon SageMaker Updates – Tokyo Region, CloudFormation, Chainer, and GreenGrass ML

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-tokyo-summit-2018/

Today, at the AWS Summit in Tokyo we announced a number of updates and new features for Amazon SageMaker. Starting today, SageMaker is available in Asia Pacific (Tokyo)! SageMaker also now supports CloudFormation. A new machine learning framework, Chainer, is now available in the SageMaker Python SDK, in addition to MXNet and Tensorflow. Finally, support for running Chainer models on several devices was added to AWS Greengrass Machine Learning.

Amazon SageMaker Chainer Estimator

Chainer is a popular, flexible, and intuitive deep learning framework. Chainer networks work on a “Define-by-Run” scheme, where the network topology is defined dynamically via forward computation. This is in contrast to many other frameworks which work on a “Define-and-Run” scheme where the topology of the network is defined separately from the data. A lot of developers enjoy the Chainer scheme since it allows them to write their networks with native python constructs and tools.

Luckily, using Chainer with SageMaker is just as easy as using a TensorFlow or MXNet estimator. In fact, it might even be a bit easier since it’s likely you can take your existing scripts and use them to train on SageMaker with very few modifications. With TensorFlow or MXNet users have to implement a train function with a particular signature. With Chainer your scripts can be a little bit more portable as you can simply read from a few environment variables like SM_MODEL_DIR, SM_NUM_GPUS, and others. We can wrap our existing script in a if __name__ == '__main__': guard and invoke it locally or on sagemaker.

import argparse
import os

if __name__ =='__main__':

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--learning-rate', type=float, default=0.05)

    # Data, model, and output directories
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

    args, _ = parser.parse_known_args()

    # ... load from args.train and args.test, train a model, write model to args.model_dir.

Then, we can run that script locally or use the SageMaker Python SDK to launch it on some GPU instances in SageMaker. The hyperparameters will get passed in to the script as CLI commands and the environment variables above will be autopopulated. When we call fit the input channels we pass will be populated in the SM_CHANNEL_* environment variables.

from sagemaker.chainer.estimator import Chainer
# Create my estimator
chainer_estimator = Chainer(
    hyperparameters={'epochs': 10, 'batch-size': 64}
# Train my estimator
chainer_estimator.fit({'train': train_input, 'test': test_input})

# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = chainer_estimator.deploy(

Now, instead of bringing your own docker container for training and hosting with Chainer, you can just maintain your script. You can see the full sagemaker-chainer-containers on github. One of my favorite features of the new container is built-in chainermn for easy multi-node distribution of your chainer training jobs.

There’s a lot more documentation and information available in both the README and the example notebooks.

AWS GreenGrass ML with Chainer

AWS GreenGrass ML now includes a pre-built Chainer package for all devices powered by Intel Atom, NVIDIA Jetson, TX2, and Raspberry Pi. So, now GreenGrass ML provides pre-built packages for TensorFlow, Apache MXNet, and Chainer! You can train your models on SageMaker then easily deploy it to any GreenGrass-enabled device using GreenGrass ML.


I want to give a quick shout out to all of our wonderful and inspirational friends in the JAWS UG who attended the AWS Summit in Tokyo today. I’ve very much enjoyed seeing your pictures of the summit. Thanks for making Japan an amazing place for AWS developers! I can’t wait to visit again and meet with all of you.


MagPi 70: Home automation with Raspberry Pi

Post Syndicated from Rob Zwetsloot original https://www.raspberrypi.org/blog/magpi-70-home-automation/

Hey folks, Rob here! It’s the last Thursday of the month, and that means it’s time for a brand-new The MagPi. Issue 70 is all about home automation using your favourite microcomputer, the Raspberry Pi.

Cover of The MagPi 70 — Raspberry Pi home automation and tech upcycling

Home automation in this month’s The MagPi!

Raspberry Pi home automation

We think home automation is an excellent use of the Raspberry Pi, hiding it around your house and letting it power your lights and doorbells and…fish tanks? We show you how to do all of that, and give you some excellent tips on how to add even more automation to your home in our ten-page cover feature.

Upcycle your life

Our other big feature this issue covers upcycling, the hot trend of taking old electronics and making them better than new with some custom code and a tactically placed Raspberry Pi. For this feature, we had a chat with Martin Mander, upcycler extraordinaire, to find out his top tips for hacking your old hardware.

Article on upcycling in The MagPi 70 — Raspberry Pi home automation and tech upcycling

Upcycling is a lot of fun

But wait, there’s more!

If for some reason you want even more content, you’re in luck! We have some fun tutorials for you to try, like creating a theremin and turning a Babbage into an IoT nanny cam. We also continue our quest to make a video game in C++. Our project showcase is headlined by the Teslonda on page 28, a Honda/Tesla car hybrid that is just wonderful.

Diddyborg V2 review in The MagPi 70 — Raspberry Pi home automation and tech upcycling

We review PiBorg’s latest robot

All this comes with our definitive reviews and the community section where we celebrate you, our amazing community! You’re all good beans

Teslonda article in The MagPi 70 — Raspberry Pi home automation and tech upcycling

An amazing, and practical, Raspberry Pi project

Get The MagPi 70

Issue 70 is available today from WHSmith, Tesco, Sainsbury’s, and Asda. If you live in the US, head over to your local Barnes & Noble or Micro Center in the next few days for a print copy. You can also get the new issue online from our store, or digitally via our Android and iOS apps. And don’t forget, there’s always the free PDF as well.

New subscription offer!

Want to support the Raspberry Pi Foundation and the magazine? We’ve launched a new way to subscribe to the print version of The MagPi: you can now take out a monthly £4 subscription to the magazine, effectively creating a rolling pre-order system that saves you money on each issue.

The MagPi subscription offer — Raspberry Pi home automation and tech upcycling

You can also take out a twelve-month print subscription and get a Pi Zero W plus case and adapter cables absolutely free! This offer does not currently have an end date.

That’s it for today! See you next month.

Animated GIF: a door slides open and Captain Picard emerges hesitantly

The post MagPi 70: Home automation with Raspberry Pi appeared first on Raspberry Pi.

Hiring a Director of Sales

Post Syndicated from Yev original https://www.backblaze.com/blog/hiring-a-director-of-sales/

Backblaze is hiring a Director of Sales. This is a critical role for Backblaze as we continue to grow the team. We need a strong leader who has experience in scaling a sales team and who has an excellent track record for exceeding goals by selling Software as a Service (SaaS) solutions. In addition, this leader will need to be highly motivated, as well as able to create and develop a highly-motivated, success oriented sales team that has fun and enjoys what they do.

The History of Backblaze from our CEO
In 2007, after a friend’s computer crash caused her some suffering, we realized that with every photo, video, song, and document going digital, everyone would eventually lose all of their information. Five of us quit our jobs to start a company with the goal of making it easy for people to back up their data.

Like many startups, for a while we worked out of a co-founder’s one-bedroom apartment. Unlike most startups, we made an explicit agreement not to raise funding during the first year. We would then touch base every six months and decide whether to raise or not. We wanted to focus on building the company and the product, not on pitching and slide decks. And critically, we wanted to build a culture that understood money comes from customers, not the magical VC giving tree. Over the course of 5 years we built a profitable, multi-million dollar revenue business — and only then did we raise a VC round.

Fast forward 10 years later and our world looks quite different. You’ll have some fantastic assets to work with:

  • A brand millions recognize for openness, ease-of-use, and affordability.
  • A computer backup service that stores over 500 petabytes of data, has recovered over 30 billion files for hundreds of thousands of paying customers — most of whom self-identify as being the people that find and recommend technology products to their friends.
  • Our B2 service that provides the lowest cost cloud storage on the planet at 1/4th the price Amazon, Google or Microsoft charges. While being a newer product on the market, it already has over 100,000 IT and developers signed up as well as an ecosystem building up around it.
  • A growing, profitable and cash-flow positive company.
  • And last, but most definitely not least: a great sales team.

You might be saying, “sounds like you’ve got this under control — why do you need me?” Don’t be misled. We need you. Here’s why:

  • We have a great team, but we are in the process of expanding and we need to develop a structure that will easily scale and provide the most success to drive revenue.
  • We just launched our outbound sales efforts and we need someone to help develop that into a fully successful program that’s building a strong pipeline and closing business.
  • We need someone to work with the marketing department and figure out how to generate more inbound opportunities that the sales team can follow up on and close.
  • We need someone who will work closely in developing the skills of our current sales team and build a path for career growth and advancement.
  • We want someone to manage our Customer Success program.

So that’s a bit about us. What are we looking for in you?

Experience: As a sales leader, you will strategically build and drive the territory’s sales pipeline by assembling and leading a skilled team of sales professionals. This leader should be familiar with generating, developing and closing software subscription (SaaS) opportunities. We are looking for a self-starter who can manage a team and make an immediate impact of selling our Backup and Cloud Storage solutions. In this role, the sales leader will work closely with the VP of Sales, marketing staff, and service staff to develop and implement specific strategic plans to achieve and exceed revenue targets, including new business acquisition as well as build out our customer success program.

Leadership: We have an experienced team who’s brought us to where we are today. You need to have the people and management skills to get them excited about working with you. You need to be a strong leader and compassionate about developing and supporting your team.

Data driven and creative: The data has to show something makes sense before we scale it up. However, without creativity, it’s easy to say “the data shows it’s impossible” or to find a local maximum. Whether it’s deciding how to scale the team, figuring out what our outbound sales efforts should look like or putting a plan in place to develop the team for career growth, we’ve seen a bit of creativity get us places a few extra dollars couldn’t.

Jive with our culture: Strong leaders affect culture and the person we hire for this role may well shape, not only fit into, ours. But to shape the culture you have to be accepted by the organism, which means a certain set of shared values. We default to openness with our team, our customers, and everyone if possible. We love initiative — without arrogance or dictatorship. We work to create a place people enjoy showing up to work. That doesn’t mean ping pong tables and foosball (though we do try to have perks & fun), but it means people are friendly, non-political, working to build a good service but also a good place to work.

Do the work: Ideas and strategy are critical, but good execution makes them happen. We’re looking for someone who can help the team execute both from the perspective of being capable of guiding and organizing, but also someone who is hands-on themselves.

Additional Responsibilities needed for this role:

  • Recruit, coach, mentor, manage and lead a team of sales professionals to achieve yearly sales targets. This includes closing new business and expanding upon existing clientele.
  • Expand the customer success program to provide the best customer experience possible resulting in upsell opportunities and a high retention rate.
  • Develop effective sales strategies and deliver compelling product demonstrations and sales pitches.
  • Acquire and develop the appropriate sales tools to make the team efficient in their daily work flow.
  • Apply a thorough understanding of the marketplace, industry trends, funding developments, and products to all management activities and strategic sales decisions.
  • Ensure that sales department operations function smoothly, with the goal of facilitating sales and/or closings; operational responsibilities include accurate pipeline reporting and sales forecasts.
  • This position will report directly to the VP of Sales and will be staffed in our headquarters in San Mateo, CA.


  • 7 – 10+ years of successful sales leadership experience as measured by sales performance against goals.
    Experience in developing skill sets and providing career growth and opportunities through advancement of team members.
  • Background in selling SaaS technologies with a strong track record of success.
  • Strong presentation and communication skills.
  • Must be able to travel occasionally nationwide.
  • BA/BS degree required

Think you want to join us on this adventure?
Send an email to jobscontact@backblaze.com with the subject “Director of Sales.” (Recruiters and agencies, please don’t email us.) Include a resume and answer these two questions:

  1. How would you approach evaluating the current sales team and what is your process for developing a growth strategy to scale the team?
  2. What are the goals you would set for yourself in the 3 month and 1-year timeframes?

Thank you for taking the time to read this and I hope that this sounds like the opportunity for which you’ve been waiting.

Backblaze is an Equal Opportunity Employer.

The post Hiring a Director of Sales appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.