范围内的整数是否存在 md5 冲突 (1,1000000000)

Question

我想使用范围在 1 到 999999999 之间的预先存在的唯一标识符的 md5 散列在 python 中生成一个 uuid

很明显，对于如此小的数字，没有任何担心......但这让我思考，我们是否知道具有相同 md5 哈希值的最小两个整数？

Answer 1

你可以测试一下：

provided you have ample memory, and/or an OS that manages excess memory usage with disk access, but then it will be rather slow - thanks to @EricDuminil & @vsenko in the comments

import hashlib

h = hashlib.new('md5', buffer=64)
md5hashes = set()

N = 1000000000

for _ in range(N):
    if _ % 10000000 == 0:
        print(".", flush = True)
    h.update(str(_).encode())
    hsh = h.hexdigest()
    if hsh in md5hashes:
        print(_, hsh)
    md5hashes.add(hsh)


print(f"has collisions: {len(md5hashes) != N})

Answer 2

您可以使用 MongoDb 作为数据存储来测试它：

import hashlib
import sys
import pymongo

def int_to_bytes(x):
    return x.to_bytes((x.bit_length() + 7) // 8, byteorder=sys.byteorder)

CLIENT = pymongo.MongoClient('mongodb://localhost:27017/')
DB = CLIENT['md5test']
COLLECTION = DB['vals']

COLLECTION.create_index('value', unique=True)

for x in range(1000000000):
    if x % 1000000 == 0:
        print(x)
    m = hashlib.md5()
    m.update(int_to_bytes(x))
    value = m.digest()
    COLLECTION.insert_one({
        'value': value
    })

print('done!')

它会在重复的情况下抛出。但是处理您的范围需要时间。

整数编码方式也很重要，请参阅 int_to_bytes()。

Answer 3

我测试过，没有。

用 set() 计数将在我的 64GB 内存盒上触发 MemoryError，所以我将 hexlify 哈希写入磁盘：

import hashlib
from multiprocessing import Pool
import os

def f(args):
    s, e = args
    l = []
    for i in xrange(s, e):
        h = hashlib.md5(str(i)).hexdigest()
        l.append(h)
    l.sort()
    fn = '/data/tmp/%s_%s' % (s, e)
    with open(fn, 'w') as f:
        for h in l:
            f.write('%s\n' % (h,))

def main():
    def gen():
        end = 1000000000
        step = 5000000
        s = 0
        while s < end:
            yield s, s+step
            s += step

    pool = Pool(processes=16)
    res = pool.imap_unordered(f, gen())
    list(res)

然后用 sort(1) 计数：

sort -mu /data/tmp/* | wc -l

产量：

1000000000

请注意，我将整数编码为 ASCII 字符串。

范围内的整数是否存在 md5 冲突 (1,1000000000)

Are there md5 collisions for integers in range(1,1000000000)

python

hash

uuid

md5

guid