为什么 Git 使用加密哈希函数?

Why does Git use a cryptographic hash function?

为什么 Git 使用加密哈希函数 SHA-1 而不是更快的非加密哈希函数?

相关问题:

堆栈溢出问题 Why does Git use SHA-1 as version numbers? 询问为什么 Git 使用 SHA-1 而不是序列号进行提交。

TLDR;

  • 从 2005 年到 2018 年/Git 2.18:SHA-1(见下文)
  • to SHA-256

您可以从 Linus Torvalds himself, when he presented Git to Google back in 2007:
检查 (强调我的)

We check checksums that is considered cryptographically secure. Nobody has been able to break SHA-1, but the point is, SHA-1 as far as git is concerned, isn't even a security feature. It's purely a consistency check.
The security parts are elsewhere. A lot of people assume since git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.

Having a good hash is good for being able to trust your data, it happens to have some other good features, too, it means when we hash objects, we know the hash is well distributed and we do not have to worry about certain distribution issues.

Internally it means from the implementation standpoint, we can trust that the hash is so good that we can use hashing algorithms and know there are no bad cases.

So there are some reasons to like the cryptographic side too, but it's really about the ability to trust your data.
I guarantee you, if you put your data in git, you can trust the fact that five years later, after it is converted from your harddisc to DVD to whatever new technology and you copied it along, five years later you can verify the data you get back out is the exact same data you put in. And that is something you really should look for in a source code management system.


2017 年 12 月更新 Git 2.16(2018 年第一季度):支持替代 SHA 的工作正在进行中:请参阅“”。


我在“How would git handle a SHA-1 collision on a blob?”中提到您可以设计一个带有特定 SHA1 prefix 的提交(仍然是一个非常代价高昂的努力)。
但重点仍然存在,因为 Eric Sink mentions in "Git: Cryptographic Hashes" (Version Control by Example (2011) book:

It is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.

很难找到 good non-cryptographic hash with low collision rate, unless you consider research like "Finding State-of-the-Art Non-cryptographic Hashes with Genetic Programming"。

您还可以阅读“Consider use of non-cryptographic hash algorithm for hashing speed-up", which mentions for instance "xxhash”,这是一种非常快速的非加密哈希算法,工作速度接近 RAM 限制。


关于更改 Git 中的哈希值的讨论并不新鲜:

(莱纳斯·托瓦兹)

There's not really anything remaining of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;)

而且你需要小心 how to measure the actual optimization gain

(莱纳斯·托瓦兹)

I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues.

(约翰·塔普塞尔 - johnflux)

The engineering cost for upgrading git from SHA-1 to a new algorithm is much higher. I'm not sure how it can be done well.

First of all we probably need to deploy a version of git (let's call it version 2 for this conversation) which allows there to be a slot for a new hash value even though it doesn't read or use that space -- it just uses the SHA-1 hash value which is in the other slot.

That way once we eventually deploy yet a newer version of git, let's call it version 3, which produces SHA-3 hashes in addition to SHA-1 hashes, people using git version 2 will be able to continue to inter-operate.
(Although, per this discussion, they may be vulnerable and people who rely on their SHA-1-only patches may be vulnerable.)

简而言之,切换到 any 哈希并不容易。


2017 年 2 月更新:是的,理论上可以计算碰撞 SHA1:shattered.io

How is GIT affected?

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits.
It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one.
An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision.

但是:

This attack required over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.

所以我们现在还不要恐慌。
在“How would Git handle a SHA-1 collision on a blob?”查看更多信息。