在有很多提交的存储库上加速 `git blame`

Speed up `git blame` on repository with many commits

我正在尝试 git blame 以下文件(运行 在我的本地机器上),因为它太慢而无法生成 GitHub 的指责:

https://github.com/Homebrew/homebrew-core/blob/master/Formula/sqlite.rb

但是在本地 运行 也很慢,在我的机器上超过一分钟

time git --no-pager blame Formula/sqlite.rb > /dev/null

存储库包含超过 150K 次提交。

有没有办法加快 git blame 命令的速度?

按照 Git 标准,自制软件核心存储库相当大。一个 250 MB 的存储库,150,000 次提交用于 4000 "Formulas"。这会影响性能。 Git集线器确实有问题。

git blame Formula/sqlite.rb 在装有 Git 2.22.0 的 2018 i7 Macbook 上大约需要 45 秒。慢 Git 标准,但考虑到运行频率 git blame.

是可以接受的

作为此存储库的用户,没有什么可做的。 git blame 必须向后搜索每个提交以查看哪些更改了此文件。不幸的是 git blame 似乎没有利用并行处理。

有一些选项...

  1. 联系 Github 了解问题,希望他们能解决。
  2. 限制您查看历史的时间:git blame --since=1.year -- Formula/sqlite.rb
  3. 重新考虑此回购中需要快速 git blame 的任何过程。
  4. 缓存结果。

在 Git 2.27(2020 年第 2 季度)中,“git blame”学会利用“changed-pathsBloom filter stored in the commit-graph file, and

参见 commit 1b4c57f, commit 24b7d1e, commit fe88f9f (23 Apr 2020) by Jeff King (peff)
参见 commit 0906ac2, commit b23ea97, commit 8918e37 (16 Apr 2020) by Derrick Stolee (derrickstolee)
(由 Junio C Hamano -- gitster -- in commit 6d56d4c 合并,2020 年 5 月 1 日)

blame: use changed-path Bloom filters

Signed-off-by: Derrick Stolee

The changed-path Bloom filters help reduce the amount of tree parsing required during history queries.

Before calculating a diff, we can ask the filter if a path changed between a commit and its first parent.

  • If the filter says "no" then we can move on without parsing trees.
  • If the filter says "maybe" then we parse trees to discover if the answer is actually "yes" or "no".

When computing a blame, there is a section in find_origin() that computes a diff between a commit and one of its parents.
When this is the first parent, we can check the Bloom filters before calling diff_tree_oid().

In order to make this work with the blame machinery, we need to initialize a struct bloom_key with the initial path. But also, we need to add more keys to a list if a rename is detected. We then check to see if any of these keys answer "maybe" in the diff.

If a user requests copy detection using "git blame -C", then there are more places where the set of "important" files can expand. I do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this mode.
A later change could expand the bloom_key data with an appropriate call (or calls) to add_bloom_key().

Generally, this is a performance enhancement and should not change the behavior of 'git blame' in any way.
If a repo has a commit-graph file with computed changed-path Bloom filters, then they should notice improved performance for their 'git blame' commands.

Here are some example timings that I found by blaming some paths in the Linux kernel repository:

I specifically looked for "deep" paths that were also edited many times.
As a counterpoint, the MAINTAINERS file was edited many times but is located in the root tree.
This means that the cost of computing a diff relative to the pathspec is very small. Here are the timings for that command:

These timings are the best of five.
The worst-case runs were on the order of 2.5 minutes for both cases.
Note that the MAINTAINERS file has 18,740 lines across 17,000+ commits. This happens to be one of the cases where this change provides the least improvement.

The lack of improvement for the MAINTAINERS file and the relatively modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine which lines were changed by each commit. That makes up a large proportion of the computation time, and this change does not attempt to improve on that section of the algorithm.
The MAINTAINERS file is large and changed often, so it takes time to determine which lines were updated by which commit. In contrast, the code files are much smaller, and it takes longer to compute the line-by-line diff for a single patch on the Linux mailing lists.

Outside of the "-C" integration, I believe there is little more to gain from the changed-path Bloom filters for 'git blame' after this patch.


确保使用 Git 2.29(2020 年第 4 季度),因为有一个小错误:

参见 commit 1302bad (08 Sep 2020) by Edmundo Carmona Antoranz (eantoranz)
(由 Junio C Hamano -- gitster -- in commit e1dd499 合并,2020 年 9 月 18 日)

blame.c: replace instance of !oidcmp for oideq

Signed-off-by: Edmundo Carmona Antoranz

0906ac2b ("blame: use changed-path Bloom filters", 2020-04-16, Git v2.27.0-rc0 -- merge listed in batch #6) introduced a call to oidcmp() that should have been oideq(), which was introduced in 14438c44 ("introduce hasheq() and oideq()", 2018-08-28, Git v2.20.0-rc0 -- merge listed in batch #1).


使用 Git 2.29(2020 年第 4 季度),“git commit-graph(man) write”学会了限制使用 [=53= 从头开始​​计算的布隆过滤器的数量]选项。

这将有益于 git blame

参见 commit d356d5d, commit 98bb796, commit 59f0d50, commit 97ffa4f (17 Sep 2020), commit 809e032 (18 Sep 2020), commit 9a7a9ed, commit 312cff5 (16 Sep 2020), and commit b66d847, commit 24f951a, commit ab14d06, commit 025d529, commit 4f36440 (09 Sep 2020) by Taylor Blau (ttaylorr)
参见 commit b16a827 (16 Sep 2020) by Derrick Stolee (derrickstolee)
(由 Junio C Hamano -- gitster -- in commit 288ed98 合并,2020 年 9 月 29 日)

builtin/commit-graph.c: introduce '--max-new-filters='

Helped-by: Junio C Hamano
Signed-off-by: Taylor Blau

Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write'(man) is willing to compute from scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a variety of reasons:

  • There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc).
  • Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths'(man) to be rather unpredictable.

To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch.
This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do.

git commit-graph 现在包含在其 man page 中:

With the --max-new-filters=<n> option, generate at most n new Bloom filters (if --changed-paths is specified).
If n is -1, no limit is enforced.
Only commits present in the new layer count against this limit.
To retroactively compute Bloom filters over earlier layers, it is advised to use --split=replace.


使用 Git 2.31(2021 年第一季度),优化“git blame"(man)

参见 commit 8e16eff (17 Feb 2021) by Rafael Silva (raffs)
(由 Junio C Hamano -- gitster -- in commit 18decfd 合并,2021 年 2 月 25 日)

blame: remove unnecessary use of get_commit_info()

Signed-off-by: Rafael Silva
Reviewed-by: Taylor Blau

When git blame(man) --color-by-age, the determine_line_heat() is called to select how to color the output based on the commit's author date.
It uses the get_commit_info() to parse the information into a commit_info structure, however, this is actually unnecessary because the determine_line_heat() caller also does the same.

Instead, let's change the determine_line_heat() to take a commit_info structure and remove the internal call to get_commit_info() thus cleaning up and optimizing the code path.

Enabling Git's trace2 API in order to record the execution time for every call to determine_line_heat() function:

+ trace2_region_enter("blame", "determine_line_heat", the_repository);
  determine_line_heat(ent, &default_color);
+ trace2_region_enter("blame", "determine_line_heat", the_repository);

Then, running git blame for "kernel/fork.c" in linux.git and summing all the execution time for every call (around 1.3k calls) resulted in 2.6x faster execution (best out 3):

git built from 328c109303 (The eighth batch, 2021-02-12) = 42ms
git built from 328c109303 + this change                  = 16ms