CRC 校验和的分配

Question

我正在研究 CRC 校验和用作散列时的冲突概率。我知道如何计算均匀分布的哈希算法的碰撞概率（这意味着获得随机输入数据的所有可能校验和的机会是相同的）。

我不知道的（我在网上找不到）：

CRC 校验和通常[不]均匀分布吗？
分布是否取决于多项式？
分布是否取决于输入数据大小？

P.S.: 我知道使用 CRC 作为散列时的限制，所以这不是这个问题的一部分。

Answer 1

除了恶意（您可以通过更改消息中的位来强制执行您喜欢的任何 CRC）之外，CRC 均匀分布在所有值上。多项式无所谓，只要是有效的CRC多项式即可，输入只需为CRC的大小或更大即可。

Answer 2

我也很好奇这个，所以我在Linux上使用crc32命令做了一些测试：

# I am printing each number several times so the data is longer than 32-bits:
$ for N in {000001..999999}; do echo -n $N$N$N$N | crc32 /dev/stdin; done >crcs


# There are no complete (8-character) collisions:
$ cat crcs | sort | uniq -d | wc -l
0

# There are no 7-character collisions:
$ for COL in 1 2; do cat crcs | awk "{print substr($1,$COL,7)}" | sort | uniq -d; done | wc -l
0

# There are exactly 32k 6-character collisions:
$ for COL in 1 2 3; do cat crcs | awk "{print substr($1,$COL,6)}" | sort | uniq -d; done | wc -l
32768


# Also, the distribution of the letters in each column is *extremely* uniform.
# Each column has results similar to these:
$ cat crcs | awk '{print substr(,1,1)}' | sort | uniq -c
  62440 0
  62439 1
  62440 2
  62440 3
  62560 4
  62560 5
  62560 6
  62560 7
  62560 8
  62560 9
  62560 a
  62560 b
  62440 c
  62440 d
  62440 e
  62440 f

...所以我的结论是 CRC32 在均匀分布校验和方面做得非常好。

CRC 校验和的分配

Distribution of CRC checksums

hash

distribution

crc

hash-collision