`git diff --patience` 和 `git diff --histogram` 有什么区别？

Question

This earlier question asked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between myers and patience, which is pretty well explained elsewhere.

histogram 策略如何运作？它与 patience 有什么区别？ git-diff man page 只说它 "extends the patience algorithm to " 支持低频公共元素"。其他页面提到它更快，并且它来自 JGit，但它们没有解释 它的算法或结果在何处或如何与 patience.

在哪里可以找到 histogram 算法相对于 patience 算法 的描述，其详细程度与 Bram Cohen's original description of the patience algorithm?

（如果这只是实现性能的问题，没有任何情况会产生不同的结果，为什么不将其作为 patience 的新后端实现？）

Answer 1

此直方图策略是在 git 1.7.7 (Sept 2011) 中引入的，具有以下描述（如 OP 所述）

"git diff" learned a "--histogram" option to use a different diff generation machinery stolen from jgit, which might give better performance.

JGit 包括 src/org/eclipse/jgit/diff/HistogramDiff.java and tst/org/eclipse/jgit/diff/HistogramDiffTest.java

那里的描述比较完整：

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.

The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.

By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead.
This offers more readable diffs than simply falling back on the standard Myers' O(ND) algorithm would produce.

To prevent the algorithm from having an O(N^2) running time, an upper limit on the number of unique elements in a histogram bucket is configured by #setMaxChainLength(int).
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to #setFallbackAlgorithm(DiffAlgorithm).
If no fallback algorithm is configured, the region is emitted as a replace edit.

During scanning of sequence B, any element of A that occurs more than #setMaxChainLength(int) times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.

So long as #setMaxChainLength(int) is a small constant (such as 64), the algorithm runs in O(N * D) time, where N is the sum of the input lengths and D is the number of edits in the resulting EditList.
If the supplied SequenceComparator has a good hash function, this implementation typically out-performs MyersDiff, even though its theoretical running time is the same.

This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements

请注意，这种算法是 already used for pack_check, back in 2006 (git 1.3), for git-verify-pack -v. It was reused for index-pack in git 1.7.7

Commit 8c912ee实际上引入了--histogram diff:

Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patience cousin, as well as the default Meyers algorithm.

The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's 2^28 line limit.

We also use xdiff's default hash table implementation (xdl_hash_bits() with XDL_HASHLONG()) for convenience.

commit 8555123 (git 1.7.10, April 2012) 添加：

8c912ee (teach --histogram to diff, 2011-07-12) claimed histogram diff was faster than both Myers and patience.

We have since incorporated a performance testing framework, so add a test that compares the various diff tasks performed in a real 'log -p' workload.
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.

最后，commit 07ab4de (git 1.8.2, March 2013)添加

config: Introduce diff.algorithm variable

Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn't play nicely with other tools based on diff (git-show for instance).

Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:

'myers' (which has the same effect as not setting the config variable at all),

'minimal',

'patience' and

'histogram'.

Commit 07924d4 同时添加了 --diff-algorithm 命令行选项。
作为 OP Stuart P. Bentley mentions :

you can configure Git to use histogram by default with:
git config --global diff.algorithm histogram

更新：Git2.12（2017 年第 1 季度）将淘汰在某些极端情况下具有灾难性性能问题的“快速哈希”。

参见commit 1f7c926 (01 Dec 2016) by Jeff King (peff)。 ^{（由 Junio C Hamano -- gitster -- in commit 731490b 合并，2016 年 12 月 19 日）}

xdiff: drop XDL_FAST_HASH

The xdiff code hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.

The idea of XDL_FAST_HASH is to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p" on git.git improves by ~8% with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.

There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases.

注意：“git diff --histogram”有一个错误的内存使用模式，它有重新安排以减少高峰使用，Git 2.19（2018 年第三季度）。

参见 commit 79cb2eb, commit 64c4e8b, commit c671d4b, commit 2820985 (19 Jul 2018) by Stefan Beller (stefanbeller)。
^{（由 Junio C Hamano -- gitster -- in commit 57fbd8e 合并，2018 年 8 月 15 日）}

xdiff/xhistogram: move index allocation into find_lcs

This fixes a memory issue when recursing a lot, which can be reproduced as
seq 1   100000 >one
seq 1 4 100000 >two
git diff --no-index --histogram one two
Before this patch, histogram_diff would call itself recursively before calling free_index, which would mean a lot of memory is allocated during the recursion and only freed afterwards.

By moving the memory allocation (and its free call) into find_lcs, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory.

This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above.

注意：耐心和直方图算法有内存泄漏，已通过 Git 2.36（2022 年第二季度）

修复

参见 commit 43ad3af, commit 4a37b80, commit 61f8839, commit 9df0fc3 (16 Feb 2022) by Phillip Wood (phillipwood)。
^{（由 Junio C Hamano -- gitster -- in commit 47be28e 合并，2022 年 3 月 9 日）}

xdiff: fix a memory leak

^{Reported-by: Junio C Hamano}
^{Signed-off-by: Phillip Wood}

Although the patience and histogram algorithms initialize the environment they do not free it if there is an error.
In contrast for the Myers algorithm the environment is initalized in xdl_do_diff() and it is freed if there is an error.
Fix this by always initializing the environment in xdl_do_diff() and freeing it there if there is an error.

`git diff --patience` 和 `git diff --histogram` 有什么区别？

What's the difference between `git diff --patience` and `git diff --histogram`?

git

algorithm

diff

git-diff

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

`config`: Introduce `diff.algorithm` variable

`xdiff`: drop `XDL_FAST_HASH`

`xdiff/xhistogram`: move index allocation into `find_lcs`

`xdiff`: fix a memory leak

`git diff --patience` 和 `git diff --histogram` 有什么区别？

What's the difference between `git diff --patience` and `git diff --histogram`?

git

algorithm

diff

git-diff

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

config: Introduce diff.algorithm variable

xdiff: drop XDL_FAST_HASH

xdiff/xhistogram: move index allocation into find_lcs

xdiff: fix a memory leak

`config`: Introduce `diff.algorithm` variable

`xdiff`: drop `XDL_FAST_HASH`

`xdiff/xhistogram`: move index allocation into `find_lcs`

`xdiff`: fix a memory leak