为什么 git log --find-object 会为给定的 blob 提交两个内容不同的文件？

Question

我通过提供 git 文件 blob（文件内容哈希）使用 git log --find-object 到 identify commits。

这通常工作正常，我之前使用 git hash-object.

获取文件的 blob

但是，有时对于文件的给定 blob 哈希，git log --find-object=<blob> returns 同一文件的两次提交，其中返回提交的文件内容肯定不同.

获得多个提交，其中相应的文件内容与我期望的相同，但是报告内容 不完全相同 的提交对我来说似乎很奇怪（这是基于关于我如何理解 --find-object 选项 atm)。

这是为什么？我在哪里需要详细说明命令？

Answer 1

正如 documentation (also refer to the -S and -G option 所说的那样理解它）：
使用此选项，如果所述对象的出现次数更改 .

将被提及

因此，如果您在存储库中获取文件的 blobid（例如，文件 Readme.md 的 blobid）

git log --find-object=<blobid> 将：

报告提交此 blobid 显示为文件 Readme.md（这是您所期望的），
报告 blob 消失作为文件 Readme.md 的提交，例如：更改 Readme.md 内容的提交 from blobid 到别的地方 ;

报告提交此 blob 在某些 其他路径 出现或消失的位置，例如：在某些时候，包含的文件 doc/Doc.md 具有完全相同的 blobid ;

not 报告提交，其中具有该确切内容的文件已重命名，例如：文件 doc/Doc.md 已重命名为 Readme.md，或来自 Readme.md 到 doc/Doc.md

你可以运行 :

git ls-tree -r <commit> | grep <blobid> # check parent commit too : git ls-tree -r <commit>^ | grep <blobid>

查看哪个 <commit> 包含该 blob，以及在什么路径。

如果你想检查是什么修改了精确路径 Readme.md，你可以将它作为过滤器添加到 git log :

git log --find-object=blobid -- Readme.md

这将消除上面的情况 3. 和 4.
您仍然会看到您查找的内容位于父提交中的提交（上述情况 2）。

Answer 2

这比更精确一些，尽管后者涵盖了最常见的情况。 git log --find-object 所做的是查找提交，从父到子，提交 更改了该特定 blob 的出现次数 。

例如，假设我们创建一个新的空存储库，其中包含一个带有 README 文件的初始提交：

$ mkdir tlog
$ cd tlog
$ git init
Initialized empty Git repository in [path]
$ echo test find-object stuff > README
$ git add README
$ git commit -m initial
[master (root-commit) 2177143] initial
 1 file changed, 1 insertion(+)
 create mode 100644 README

现在让我们创建一个 blob，提交它，并观察它的哈希 ID：

$ echo file content > afile
$ git add afile
$ git commit -m 'add some content'
[master 45c4e39] add some content
 1 file changed, 1 insertion(+)
 create mode 100644 afile
$ git rev-parse HEAD:afile
dd59d098638313f5d00a7fa657379b33b191f2e2
$ blobid=$(git rev-parse HEAD:afile)

现在让我们提交一个不改变 具有该 blob 哈希 ID 的文件数量 的提交，方法是添加一个具有不同内容的文件，然后添加第三个文件与第一个文件相同的内容——因此相同的 blob 哈希 ID：

$ echo different > bfile
$ git add bfile && git commit -m 'add different content'
[master c5a5306] add different content
 1 file changed, 1 insertion(+)
 create mode 100644 bfile
$ cp afile cfile && git add cfile 
$ git commit -m 're-add same content as afile, ie, same blob id'
[master 20c97e5] re-add same content as afile, ie, same blob id
 1 file changed, 1 insertion(+)
 create mode 100644 cfile
$ git rev-parse HEAD:cfile
dd59d098638313f5d00a7fa657379b33b191f2e2

如您所见，相同的哈希 ID 再次出现。（事实上，任何包含与我的 afile 或 cfile 匹配的文件的存储库都包含该 blob 哈希 ID！commits 将具有唯一的哈希 ID，但是任何读取 file content 加上单个换行符的文件都将具有 blob 哈希 ID dd59d098638313f5d00a7fa657379b33b191f2e2。）

现在让我们看看 git log --oneline 和 git log --oneline --find-object=$blobid 输出：

$ git log --oneline
20c97e5 (HEAD -> master) re-add same content as afile, ie, same blob id
c5a5306 add different content
45c4e39 add some content
2177143 initial
$ git log --oneline --find-object=$blobid
20c97e5 (HEAD -> master) re-add same content as afile, ie, same blob id
45c4e39 add some content

我们在两种情况下都看到了提交 45c4e39，因为比较 2177143 initial 和 45c4e39 add some content 表明 文件数量 [=24] =] 因为他们的对象哈希已经从零变为一。我们看到 20c97e5 是因为将该提交与其父提交 c5a5306 进行比较表明 文件数量 已从 1 变为 2。如果我们删除一个副本，计数将再次更改，我们将看到该提交。如果我们删除 both 个副本，计数将更改（为零）并且我们将看到 that commit.

换句话说，我们看到的是每个提交，其中具有给定哈希 ID 的 blob 对象的计数发生变化。

在这个 git log 选项中存在某种错误：它依赖于这样一个事实，即每个提交都有 一个单亲 。如果我们有一个 merge 提交——一个有两个或多个父项的提交——Git 必须将合并中的 blob 哈希 ID 与 两个父项。也许计数在一个比较中发生变化，但在另一个比较中没有变化。 Git 应该用这个做什么？ Git 目前的答案是它在这里完全崩溃——因此是“各种错误”——但是有了队列中的修复，你会得到更好但仍然不完美，因为这种情况没有明显的正确答案。（错误是 Git 正在通过 git log 中的特殊代码路径，这意味着处理历史简化，这是错误的做法。建议的修复使 Git 通过一条更合适的路径，这样您至少会看到合并在计数上有一些变化，这显然要好得多。但这为其他选项留下了其他情况它也总是能正常工作。Git 需要 diffs-across-merges 的通用解决方案，而这需要一个当前不存在的框架。）

Answer 3

请注意，该命令的结果可能会随着 Git 2.29（2020 年第 4 季度）而改变：“git log -c --find-object=X”无法很好地找到涉及对 [=57 的更改的合并=] X 仅来自一个 parent.

参见 commit 957876f (30 Sep 2020) by Jeff King (peff)。
^{（由 Junio C Hamano -- gitster -- in commit 7da656f 合并，2020 年 10 月 5 日）}

combine-diff: handle --find-object in multitree code path

^{Signed-off-by: Jeff King}

When doing combined diffs, we have two possible code paths:

a slower one which independently diffs against each parent, applies any filters, and then intersects the resulting paths

a faster one which walks all trees simultaneously

When the diff options specify that we must do certain filters, like pickaxe, then we always use the slow path, since the pickaxe code only knows how to handle filepairs, not the n-parent entries generated for combined diffs.

But there are two problems with the slow path:

It's slow. Running:

git rev-list HEAD | git diff-tree --stdin -r -c

in git.git takes ~3s on my machine.
But adding "--find-object" to that increases it to ~6s, even though find-object itself should incur only a few extra oid comparisons.
On linux.git, it's even worse: 35s versus 215s. 2. It doesn't catch all cases where a particular path is interesting.
Consider a merge with parent blobs X and Y for a particular path, and end result Z. That should be interesting according to "-c", because the result doesn't match either parent. And it should be interesting even with "--find-object=X", because "X" went away in the merge.

But because we perform each pairwise diff independently, this confuses the intersection code. The change from X to Z is still interesting according to --find-object. But in the other parent we went from Y to Z, so the diff appears empty! That causes the intersection code to think that parent didn't change the path, and thus it's not interesting for "-c".

This patch fixes both by implementing --find-object for the multitree code.

It's a bit unfortunate that we have to duplicate some logic from diffcore-pickaxe, but this is the best we can do for now. In an ideal world, all of the diffcore code would stop thinking about filepairs and start thinking about n-parent sets, and we could use the multitree walk with all of it.

Until then, there are some leftover warts:

other pickaxe operations, like -S or -G, still suffer from both problems.
These would be hard to adapt because they rely on having a diff_filespec() for each path to look at content. And we'd need to define what an n-way "change" means in each case (probably easy for "-S", which can compare counts, but not so clear for -G, which is about grepping diffs).

other options besides --find-object may cause us to use the slow pairwise path, in which case we'll go back to producing a different (wrong) answer for the X/Y/Z case above.

We may be able to hack around these, but I think the ultimate solution will be a larger rewrite of the diffcore code.
For now, this patch improves one specific case but leaves the rest.

为什么 git log --find-object 会为给定的 blob 提交两个内容不同的文件？

Why does git log --find-object get two file commits with different content for a given blob?

git

hash

blob

object

`combine-diff`: handle --find-object in multitree code path

为什么 git log --find-object 会为给定的 blob 提交两个内容不同的文件？

Why does git log --find-object get two file commits with different content for a given blob?

git

hash

blob

object

combine-diff: handle --find-object in multitree code path

`combine-diff`: handle --find-object in multitree code path