如何解决异常缓慢的 git-diff?

How to troubleshoot an abnormally slow git-diff?

我最近克隆了一个远程仓库,其中一些 git 命令 运行 非常慢。例如,运行宁

git diff --quiet

...大约需要 40 秒。 (对于它的价值,回购是干净的。我使用的是 git 版本 2.20.1。)

在试图找出导致这种迟缓的原因时,我遇到了一些消除它的程序,尽管我不知道为什么。

在这些过程中,我发现的 simplest/quickest 是这样的:(从一个新克隆的 repo 实例开始)创建一个分支 master,然后检查它。在此之后,如果我再次检查 master,现在 git diff --quiet 很快完成(不到 50 毫秒)。

下面是一个交互示例,显示了各种操作的计时信息1:

rm -rf ./"$REPONAME"      #  0.174 s
git clone "$URL"          # 54.118 s
cd ./"$REPONAME"          #  0.007 s

git diff --quiet          # 39.438 s

git branch VOODOO         #  0.032 s
git checkout VOODOO       # 31.247 s
git diff --quiet          #  0.014 s

git checkout master       #  0.034 s
git diff --quiet          #  0.012 s

正如我已经强调的那样,这只是 "fix" 回购的几个可能过程之一,它们对我来说都同样神秘。这恰好是我找到的 simplest/quickest 个。

上面的时间顺序非常可重现(即,每次我 运行 完全按照所示的特定顺序得到大致相同的时间)。

然而,它对看似微小的变化非常敏感。例如,如果我将 git branch VOODOO; git checkout VOODOO 替换为 git checkout -b VOODOO,则随后的时间配置文件会发生根本变化:

rm -rf ./"$REPONAME"      #  0.015 s
git clone "$URL"          # 45.312 s
cd ./"$REPONAME"          #  0.007 s

git diff --quiet          # 46.145 s

git checkout -b VOODOO    # 42.363 s
git diff --quiet          # 41.180 s

git checkout master       # 47.345 s
git diff --quiet          #  0.018 s

我想弄清楚这是怎么回事。我该如何进一步解决问题?

是否有永久的("committable")方法来 "fix" 回购? ("fix" 我的意思是:摆脱 git diff --quietgit checkout ... 等的长时间延迟)

(顺便说一下,git gc 不会修复 repo,即使是暂时的;我试过了。)

我认为最终 "fixing" 回购是 git 开始构建 和缓存 一些 辅助数据结构 允许它高效地执行一些操作。如果这个假设是正确的,那么我的问题可以改写为:导致git建立这样的辅助数据结构的最直接的方法是什么?


编辑: 可以阐明上述内容的另外一点信息是,此存储库包含一个异常大 (1GB) 的文件。 (这就解释了git clone步慢,不知道这和git diff --quiet步慢有没有关系,如果有,又是怎么回事。)


1 不用说了,我把分支命名为 VOODOO 以反映我对正在发生的事情的无知。

首先检查 Git 2.27 以及即将推出的 2.28(2020 年第 3 季度)

问题是否仍然存在

我会用GIT_TRACE2_PERF for any performance measure. (as )

使用 Git 2.28(2020 年第 3 季度),在具有太多统计不匹配路径的工作树中,“diff --quiet”期间的内存使用量已大大减少。

它的补丁描述说明了一个用例,其中“diff --quiet”可能很慢:

参见 commit d2d7fbe (01 Jun 2020) by Jeff King (peff)
(由 Junio C Hamano -- gitster -- in commit 0cd0afc 合并,2020 年 6 月 18 日)

diff: discard blob data from stat-unmatched pairs

Reported-by: Jan Christoph Uhde
Signed-off-by: Jeff King

When performing a tree-level diff against the working tree, we may find that our index stat information is dirty, so we queue a filepair to be examined later.
If the actual content hasn't changed, we call this a stat-unmatch; the stat information was out of date, but there's no actual diff.

Normally diffcore_std() would detect and remove these identical filepairs via diffcore_skip_stat_unmatch().

However, when "--quiet" is used, we want to stop the diff as soon as we see any changes, so we check for stat-unmatches immediately in diff_change().

That check may require us to actually load the file contents into the pair of diff_filespecs.
If we find that the pair isn't a stat-unmatch, then no big deal; we'd likely load the contents later anyway to generate a patch, do rename detection, etc, so we want to hold on to it.
But if it is a stat-unmatch, then we have no more use for that data; the whole point is that we're going discard the pair. However, we never free the allocated diff_filespec data.

In most cases, keeping that data isn't a problem. We don't expect a lot of stat-unmatch entries, and since we're using --quiet, we'd quit as soon as we saw such a real change anyway.

However, there are extreme cases where it makes a big difference:

  1. We'd generally mmap() the working tree half of the pair.
    And since the OS may limit the total number of maps, we can run afoul of this in large repositories. E.g.:

     $ cd linux
    $ git ls-files | wc -l
    67959
    $ sysctl vm.max_map_count
    vm.max_map_count = 65530
    $ git ls-files | xargs touch ;# everything is stat-dirty!
    $ git diff --quiet
    fatal: mmap failed: Cannot allocate memory
    

It should be unusual to have so many files stat-dirty, but it's possible if you've just run a script like "sed -i" or similar.

After this patch, the above correctly exits with code 0.

  1. Even if you don't hit mmap limits, the index half of the pair will have been pulled from the object database into heap memory.
    Again in a clone of linux.git, running:

    $ git ls-files | head -n 10000 | xargs touch
    $ git diff --quiet
    

peaks at 145MB heap before this patch, and 94MB after.

This patch solves the problem by freeing any diff_filespec data we picked up during the "--quiet" stat-unmatch check in diff_changes.
Nobody is going to need that data later, so there's no point holding on to it.
There are a few things to note:

  • we could skip queueing the pair entirely, which could in theory save a little work. But there's not much to save, as we need a diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
    And since we cache the result of the stat-unmatch checks, a later call to diffcore_skip_stat_unmatch() call will quickly skip over them.
    The diffcore code also counts up the number of stat-unmatched pairs as it removes them. It's doubtful any callers would care about that in combination with --quiet, but we'd have to reimplement the logic here to be on the safe side. So it's not really worth the trouble.

  • I didn't write a test, because we always produce the correct output unless we run up against system mmap limits, which are bot unportable and expensive to test against. Measuring peak heap would be interesting, but our perf suite isn't yet capable of that.

  • note that diff without "--quiet" does not suffer from the same problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch entries and drop them immediately, so we're not carrying their data around.

  • you can still trigger the mmap limit problem if you truly have that many files with actual changes. But it's rather unlikely. The stat-unmatch check avoids loading the file contents if the size don't match, so you'd need a pretty trivial change in every single file.
    Likewise, inexact rename detection might load the data for many files all at once. But you'd need not just 64k changes, but that many deletions and additions. The most likely candidate is perhaps break-detection, which would load the data for all pairs and keep it around for the content-level diff. But again, you'd need 64k actually changed files in the first place.

So it's still possible to trigger this case, but it seems like "I accidentally made all my files stat-dirty" is the most likely case in the real world.


使用 Git 2.30(2021 年第一季度),“git diff"(man) 和其他共享相同机器以与工作树文件进行比较的命令已被教导利用fsmonitor 可用数据。

参见 commit 2bfa953, commit 471b115, commit ed5a245, commit 89afd5f, commit 5851462, commit dc69d47 (20 Oct 2020) by Nipunn Koorapati (nipunn1313)
参见 commit c9052a8 (20 Oct 2020) by Alex Vandiver (alexmv)
(由 Junio C Hamano -- gitster -- in commit bf69da5 合并,2020 年 11 月 9 日)

t/perf: add fsmonitor perf test for git diff

Signed-off-by: Nipunn Koorapati

Results for the git-diff fsmonitor optimization in patch in the parent-rev (using a 400k file repo to test)

As you can see here - git diff(man) with fsmonitor running is significantly better with this patch series (80% faster on my workload)!

GIT_PERF_LARGE_REPO=~/src/server ./run v2.29.0-rc1 . -- p7519-fsmonitor.sh

Test                                                                     v2.29.0-rc1       this tree
-----------------------------------------------------------------------------------------------------------------
7519.2: status (fsmonitor=.git/hooks/fsmonitor-watchman)                 1.46(0.82+0.64)   1.47(0.83+0.62) +0.7%
7519.3: status -uno (fsmonitor=.git/hooks/fsmonitor-watchman)            0.16(0.12+0.04)   0.17(0.12+0.05) +6.3%
7519.4: status -uall (fsmonitor=.git/hooks/fsmonitor-watchman)           1.36(0.73+0.62)   1.37(0.76+0.60) +0.7%
7519.5: diff (fsmonitor=.git/hooks/fsmonitor-watchman)                   0.85(0.22+0.63)   0.14(0.10+0.05) -83.5%
7519.6: diff -- 0_files (fsmonitor=.git/hooks/fsmonitor-watchman)        0.12(0.08+0.05)   0.13(0.11+0.02) +8.3%
7519.7: diff -- 10_files (fsmonitor=.git/hooks/fsmonitor-watchman)       0.12(0.08+0.04)   0.13(0.09+0.04) +8.3%
7519.8: diff -- 100_files (fsmonitor=.git/hooks/fsmonitor-watchman)      0.12(0.07+0.05)   0.13(0.07+0.06) +8.3%
7519.9: diff -- 1000_files (fsmonitor=.git/hooks/fsmonitor-watchman)     0.12(0.09+0.04)   0.13(0.08+0.05) +8.3%
7519.10: diff -- 10000_files (fsmonitor=.git/hooks/fsmonitor-watchman)   0.14(0.09+0.05)   0.13(0.10+0.03) -7.1%
7519.12: status (fsmonitor=)                                             1.67(0.93+1.49)   1.67(0.99+1.42) +0.0%
7519.13: status -uno (fsmonitor=)                                        0.37(0.30+0.82)   0.37(0.33+0.79) +0.0%
7519.14: status -uall (fsmonitor=)                                       1.58(0.97+1.35)   1.57(0.86+1.45) -0.6%
7519.15: diff (fsmonitor=)                                               0.34(0.28+0.83)   0.34(0.27+0.83) +0.0%
7519.16: diff -- 0_files (fsmonitor=)                                    0.09(0.06+0.04)   0.09(0.08+0.02) +0.0%
7519.17: diff -- 10_files (fsmonitor=)                                   0.09(0.07+0.03)   0.09(0.06+0.05) +0.0%
7519.18: diff -- 100_files (fsmonitor=)                                  0.09(0.06+0.04)   0.09(0.06+0.04) +0.0%
7519.19: diff -- 1000_files (fsmonitor=)                                 0.09(0.06+0.04)   0.09(0.05+0.05) +0.0%
7519.20: diff -- 10000_files (fsmonitor=)                                0.10(0.08+0.04)   0.10(0.06+0.05) +0.0%

I also added a benchmark for a tiny git diff(man) workload w/ a pathspec. I see an approximately .02 second overhead added w/ and w/o fsmonitor.

From looking at these results, I suspected that refresh_fsmonitor is already happening during git diff(man) - independent of this patch series' optimization.
Confirmed that suspicion by breaking on refresh_fsmonitor.

(gdb) bt  [simplified]