Highlights from Git 2.39

Post Syndicated from Taylor Blau original https://github.blog/2022-12-12-highlights-from-git-2-39/

The open source Git project just released Git 2.39, with features and bug fixes from over 86 contributors, 31 of them new. We last caught up with you on the latest in Git back when 2.38 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

If you use Git on the command-line, you have almost certainly used git log to peruse your project’s history. But you may not be as familiar with its cousin, git shortlog.

git shortlog is used to summarize the output produced by git log. For example, many projects (including Git¹) use git shortlog -ns to produce a list of unique contributors in a release, along with the number of commits they authored, like this:

$ git shortlog -ns v2.38.0.. | head -10
   166  Junio C Hamano
   118  Taylor Blau
   115  Ævar Arnfjörð Bjarmason
    43  Jeff King
    26  Phillip Wood
    21  René Scharfe
    15  Derrick Stolee
    11  Johannes Schindelin
     9  Eric Sunshine
     9  Jeff Hostetler
  [...]

We’ve talked about git shortlog in the past, most recently when 2.29 was released to show off its more flexible --group option, which allows you to group commits by fields other than their author or committer. For example, something like:

$ git shortlog -ns --group=author --group=trailer:co-authored-by

would count each commit to its author as well as any individuals in the Co-authored-by trailer.

This release, git shortlog became even more flexible by learning how to aggregate commits based on arbitrary formatting specifiers, like the ones mentioned in the pretty formats section of Git’s documentation.

One neat use is being able to get a view of how many commits were committed each month during a release cycle. Before, you might have written something like this monstrosity:

$ git log v2.38.0.. --date='format:%Y-%m' --format='%cd' | sort | uniq -c

There, --date='format:%Y-%m' tells Git to output each date field like YYYY-MM, and --format='%cd' tells Git to output only the committer date (using the aforementioned format) when printing each commit. Then, we sort the output, and count the number of unique values.

Now, you can ask Git to do all of that for you, by writing:

$ git shortlog v2.38.0.. --date='format:%Y-%m' --group='%cd' -s
     2  2022-08
    47  2022-09
   405  2022-10
   194  2022-11
     5  2022-12

Where -s tells git shortlog output a summary where the left-hand column is the number of commits attributed to each unique group (in this case, the year and month combo), and the right-hand column is the identity of each group itself.

Since you can pass any format specifier to the --group option, the flexibility here is limited only by the pretty formats available, and your own creativity.

[source]

Returning readers may remember our discussion on Git’s new object pruning mechanism, cruft packs. In case you’re new around here, no problem: here’s a refresher.

When you want to tell Git to remove unreachable objects (those which can’t be found by walking along the history of any branch or tag), you might run something like:

$ git gc --cruft --prune=5.minutes.ago

That instructs Git to divvy your repository’s objects into two packs: one containing reachable objects, and another² containing unreachable objects modified within the last five minutes. This makes sure that a git gc process doesn’t race with incoming reference updates that might leave the repository in a corrupt state. As those objects continue to age, they will be removed from the repository via subsequent git gc invocations. For (many) more details, see our post, Scaling Git’s garbage collection.

Even though the --prune=<date> mechanism of adding a grace period before permanently removing objects from the repository is relatively effective at avoiding corruption in practice, it is not completely fool-proof. And when we do encounter repository corruption, it is useful to have the missing objects close by to allow us to recover a corrupted repository.

In Git 2.39, git repack learned a new option to create an external copy of any objects removed from the repository: --expire-to. When combined with --cruft options like so:

$ git repack --cruft --cruft-expiration=5.minutes.ago -d --expire-to=../backup.git

any unreachable objects which haven’t been modified in the last five minutes are collected together and stored in a packfile that is written to ../backup.git. Then, objects you may be missing after garbage collection are readily available in the pack stored in ../backup.git.

These ideas are identical to the ones described in the “limbo repository” section of our Scaling Git’s garbage collection blog post. At the time of writing that post, those patches were still under review. Thanks to careful feedback from the Git community, the same tools that power GitHub’s own garbage collection are now available to you via Git 2.39.

On a related note, careful readers may have noticed that in order to write a cruft pack, you have to explicitly pass --cruft to both git gc and git repack. This is still the case. But in Git 2.39, users who enable the feature.experimental configuration and are running the bleeding edge of Git will now use cruft packs by default when running git gc.

[source, source]

If you’ve been following along with the gradual introduction of sparse index compatibility in Git commands, this one’s for you.

In previous versions of Git, using git grep --cached (to search through the index instead of the blobs in your working copy) you might have noticed that Git first has to expand your index when using the sparse index feature.

In large repositories where the sparse portion of the repository is significantly smaller than the repository as a whole, this adds a substantial delay before git grep --cached outputs any matches.

Thanks to the work of Google Summer of Code student, Shaoxuan Yuan, this is no longer the case. This can lead to some dramatic performance enhancements: when searching in a location within your sparse cone (for example., git grep --cached $pattern -- 'path/in/sparse/cone'), Git 2.39 outperforms the previous version by nearly 70%.

[source]

This one is a little bit technical, but bear with us, since it ends with a nifty performance optimization that may be coming to a Git server near you.

Before receiving a push, a Git server must first tell the pusher about all of the branches and tags it already knows about. This lets the client omit any objects that it knows the server already has, and results in less data being transferred overall.

Once the server has all of the new objects, it ensures that they are “connected” before entering them into the repository. Generally speaking, this “connectivity check” ensures that none of the new objects mention nonexistent objects; in other words, that the push will not corrupt the repository.

One additional factor worth noting is that some Git servers are configured to avoid advertising certain references. But those references are still used as part of the connectivity check. Taking into account the extra work necessary to incorporate those hidden references into the connectivity check, the additional runtime adds up, especially if there are a large number of hidden references.

In Git 2.39, the connectivity check was enhanced to only consider the references that were advertised, in addition to those that were pushed. In a test repository with nearly 7 million references (only ~3% of which are advertised), the resulting speed-up makes Git 2.39 outperform the previous version by roughly a factor of 4.5.

As your server operators upgrade to the latest version of Git, you should notice an improvement in how fast they are able to process incoming pushes.

[source]

Last but not least, let’s round out our recap of some of the highlights from Git 2.39 with a look at a handful of new security measures.

Git added two new “defense-in-depth” changes in the latest release. First, git apply was updated to refuse to apply patches larger than ~1 GiB in size to avoid potential integer overflows in the apply code. Git was also updated to correctly redact sensitive header information with GIT_TRACE_CURL=1 or GIT_CURL_VERBOSE=1 when using HTTP/2.

If you happen to notice a security vulnerability in Git, you can follow Git’s own documentation on how to responsibly report the issue. Most importantly, if you’ve ever been curious about how Git handles coordinating and disclosing embargoed releases, this release cycle saw a significant effort to codify and write down exactly how Git handles these types of issues.

To read more about Git’s disclosure policy (and learn about how to participate yourself!), you can find more in the repository.

[source, source, source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.39, or any previous version in the Git repository.

Notes

It’s true. In fact, the list at the bottom of the release announcement is generated by running git shortlog on the git log --no-merges between the last and current release. Calculating the number of new and existing contributors in each release is also powered by git shortlog. ↩
This is a bit of an oversimplification. In addition to storing the object modification times in an adjacent *.mtimes file, the cruft pack also contains unreachable objects that are reachable from anything modified within the last five minutes, regardless of its age. See the “mitigating object deletion raciness” section for more. ↩

Noise

Highlights from Git 2.39

The rest of the iceberg

Notes

The collective thoughts of the interwebz