获取 Python 中文件的哈希值（摘要）——一次读取整个文件与逐行读取

Question

我需要获取 Python 中文件的哈希（摘要）。

通常，在处理任何文件内容时，由于内存问题，建议逐行逐渐处理，但我需要加载整个文件才能获得其摘要。

目前我是这样获取hash的：

import hashlib

def get_hash(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        data = file.read()
    h.update(data)
    digest = h.hexdigest()
    return digest

有没有其他方法可以更优化或更简洁地执行此操作？

当仍然必须加载整个文件来计算哈希时，逐行读取文件是否比一次读取整个文件有任何显着改进？

Answer 1

当然，您可以分块加载数据，这样内存使用量就会显着下降，因为您不再需要加载整个文件。然后对每个块使用 hash.update(chunk)：

from functools import partial

Hash = hashlib.new("sha1")
size = 128 # just an example

with open("data.txt", "rb") as File:
    for chunk in iter(partial(f.read, size), b''):
        Hash.update(chunk)

我发现这个 iter 技巧非常巧妙，因为它允许编写更简洁的代码。一开始可能看起来很混乱，所以我将解释它是如何工作的：

iter(function, sentinel) 连续执行 function 并产生 returns 的值，直到其中一个等于 sentinel。
partial(f.read, size) returns 可调用 版本的 f.read(size)。这是过于简单化了，但在这种情况下仍然是正确的。

Answer 2

两个片段的结果相同：

h = hashlib.new("md5")
with open(filename,"rb") as f:
    for line in f:
        h.update(line)
print(h.hexdigest())

和

h = hashlib.new("md5")
with open(filename,"rb") as f:
    h.update(f.read())

print(h.hexdigest())

一些注意事项：

第一种方法在内存方面最适用于大文本文件。对于二进制文件，没有 "line" 这样的东西。不过，它会起作用，但是 "chunk" 方法更常规（不会解释其他答案）
如果文件很大，第二种方法会占用大量内存
在这两种情况下，请确保您以 binary 模式打开文件，否则行尾转换可能会导致错误的校验和（外部工具会计算出不同的 MD5比你的程序)

Answer 3

根据 hashlib.update() 的文档，您无需担心不同哈希算法的块大小。但是，我会测试一下。但是，好像查了一下，512是MD5的块大小，换成别的，结果和一次读完一样。

import hashlib

def get_hash(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        data = file.read()
    h.update(data)
    digest = h.hexdigest()
    return digest

def get_hash_memory_optimized(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        block = file.read(512)
        while block:
            h.update(block)
            block = file.read(512)

    return h.hexdigest()

digest = get_hash('large_bin_file')
print(digest)

digest = get_hash_memory_optimized('large_bin_file')
print(digest)

> bcf32baa9b05ca3573bf568964f34164
> bcf32baa9b05ca3573bf568964f34164

获取 Python 中文件的哈希值（摘要）——一次读取整个文件与逐行读取

Getting hash (digest) of a file in Python - reading whole file at once vs reading line by line

python

hash

md5

digest