如何在不必手动处理数据块的情况下散列大文件?
How to hash a big file without having to manually process chunks of data?
当我们想要获取 Python 中的大文件的哈希值时,使用 Python 的 hashlib
,我们可以像这样处理大小为 1024 字节的数据块:
import hashlib
m = hashlib.md5()
chunksize = 1024
with open("large.txt", 'rb') as f:
while True:
chunk = f.read(chunksize)
if not chunk:
break
m.update(chunk)
print(m.hexdigest())
或者干脆忽略分块,像这样:
import hashlib
sha256 = hashlib.sha256()
with open(f, 'rb') as g:
sha256.update(g.read())
print(sha256.hexdigest())
找到最佳实现可能很棘手,需要进行一些性能测试和改进(1024 块?4KB?64KB?等),详见 Hashing file in Python 3? 或 获取哈希一个非常大的文件的字符串
问题:是否有一个跨平台的、随时可用的函数来计算大文件的 MD5 或 SHA256,Python? (这样我们就不需要重新发明轮子,或者担心最佳块大小等)
类似于:
import hashlib
# get the result without having to think about chunks, etc.
hashlib.file_sha256('bigfile.txt')
你确定你真的需要优化这个吗?我做了一些分析,当块大小不是小得离谱时,在我的计算机上没有太多收获:
import os
import timeit
filename = "large.txt"
with open(filename, 'w') as f:
f.write('x' * 100*1000*1000) # Create 100 MB file
setup = '''
import hashlib
def md5(filename, chunksize):
m = hashlib.md5()
with open(filename, 'rb') as f:
while chunk := f.read(chunksize):
m.update(chunk)
return m.hexdigest()
'''
for i in range(16):
chunksize = 32 * 2**i
print('chunksize:', chunksize)
print(timeit.Timer(f'md5("{filename}", {chunksize})', setup=setup).repeat(2, 2))
os.remove(filename)
打印:
chunksize: 32
[1.3256129720248282, 1.2988303459715098]
chunksize: 64
[0.7864588440279476, 0.7887071970035322]
chunksize: 128
[0.5426529520191252, 0.5496777250082232]
chunksize: 256
[0.43311091500800103, 0.43472746800398454]
chunksize: 512
[0.36928231100318953, 0.37598425400210544]
chunksize: 1024
[0.34912850096588954, 0.35173907200805843]
chunksize: 2048
[0.33507052797358483, 0.33372197503922507]
chunksize: 4096
[0.3222631579847075, 0.3201586640207097]
chunksize: 8192
[0.33291386102791876, 0.31049903703387827]
chunksize: 16384
[0.3095061599742621, 0.3061956529854797]
chunksize: 32768
[0.3073280190001242, 0.30928074003895745]
chunksize: 65536
[0.30916607001563534, 0.3033451830269769]
chunksize: 131072
[0.3083479679771699, 0.3039141249610111]
chunksize: 262144
[0.3087183449533768, 0.30319386802148074]
chunksize: 524288
[0.29915712698129937, 0.29429047100711614]
chunksize: 1048576
[0.2932401319849305, 0.28639856696827337]
这表明您可以只选择一个大的但不是疯狂的块大小。例如1 MB。
创建了一个包 simple-file-checksum
for your use case that just uses subprocess 来为 macOS/Linux 调用 openssl
并为 Windows 调用 CertUtil
并仅从中提取摘要输出。
Simple File Checksum
Returns 文件的 MD5、SHA1、SHA256、SHA384 或 SHA512 校验和。
安装
运行 安装以下内容:
pip3 install simple-file-checksum
用法
Python:
>>> from simple_file_checksum import get_checksum
>>> get_checksum("tst/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="SHA1")
'2fd4e1c67a2d28fced849ee1bb76e7391b93eb12'
>>> get_checksum("tst/file.txt", algorithm="SHA256")
'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
>>> get_checksum("tst/file.txt", algorithm="SHA384")
'ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1'
>>> get_checksum("tst/file.txt", algorithm="SHA512")
'07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6'
航站楼:
$ simple-file-checksum tst/file.txt
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a MD5
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a SHA1
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
$ simple-file-checksum tst/file.txt -a SHA256
d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592
$ simple-file-checksum tst/file.txt -a SHA384
ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1
$ simple-file-checksum tst/file.txt -a SHA512
07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6
当我们想要获取 Python 中的大文件的哈希值时,使用 Python 的 hashlib
,我们可以像这样处理大小为 1024 字节的数据块:
import hashlib
m = hashlib.md5()
chunksize = 1024
with open("large.txt", 'rb') as f:
while True:
chunk = f.read(chunksize)
if not chunk:
break
m.update(chunk)
print(m.hexdigest())
或者干脆忽略分块,像这样:
import hashlib
sha256 = hashlib.sha256()
with open(f, 'rb') as g:
sha256.update(g.read())
print(sha256.hexdigest())
找到最佳实现可能很棘手,需要进行一些性能测试和改进(1024 块?4KB?64KB?等),详见 Hashing file in Python 3? 或 获取哈希一个非常大的文件的字符串
问题:是否有一个跨平台的、随时可用的函数来计算大文件的 MD5 或 SHA256,Python? (这样我们就不需要重新发明轮子,或者担心最佳块大小等)
类似于:
import hashlib
# get the result without having to think about chunks, etc.
hashlib.file_sha256('bigfile.txt')
你确定你真的需要优化这个吗?我做了一些分析,当块大小不是小得离谱时,在我的计算机上没有太多收获:
import os
import timeit
filename = "large.txt"
with open(filename, 'w') as f:
f.write('x' * 100*1000*1000) # Create 100 MB file
setup = '''
import hashlib
def md5(filename, chunksize):
m = hashlib.md5()
with open(filename, 'rb') as f:
while chunk := f.read(chunksize):
m.update(chunk)
return m.hexdigest()
'''
for i in range(16):
chunksize = 32 * 2**i
print('chunksize:', chunksize)
print(timeit.Timer(f'md5("{filename}", {chunksize})', setup=setup).repeat(2, 2))
os.remove(filename)
打印:
chunksize: 32
[1.3256129720248282, 1.2988303459715098]
chunksize: 64
[0.7864588440279476, 0.7887071970035322]
chunksize: 128
[0.5426529520191252, 0.5496777250082232]
chunksize: 256
[0.43311091500800103, 0.43472746800398454]
chunksize: 512
[0.36928231100318953, 0.37598425400210544]
chunksize: 1024
[0.34912850096588954, 0.35173907200805843]
chunksize: 2048
[0.33507052797358483, 0.33372197503922507]
chunksize: 4096
[0.3222631579847075, 0.3201586640207097]
chunksize: 8192
[0.33291386102791876, 0.31049903703387827]
chunksize: 16384
[0.3095061599742621, 0.3061956529854797]
chunksize: 32768
[0.3073280190001242, 0.30928074003895745]
chunksize: 65536
[0.30916607001563534, 0.3033451830269769]
chunksize: 131072
[0.3083479679771699, 0.3039141249610111]
chunksize: 262144
[0.3087183449533768, 0.30319386802148074]
chunksize: 524288
[0.29915712698129937, 0.29429047100711614]
chunksize: 1048576
[0.2932401319849305, 0.28639856696827337]
这表明您可以只选择一个大的但不是疯狂的块大小。例如1 MB。
创建了一个包 simple-file-checksum
for your use case that just uses subprocess 来为 macOS/Linux 调用 openssl
并为 Windows 调用 CertUtil
并仅从中提取摘要输出。
Simple File Checksum
Returns 文件的 MD5、SHA1、SHA256、SHA384 或 SHA512 校验和。
安装
运行 安装以下内容:
pip3 install simple-file-checksum
用法
Python:
>>> from simple_file_checksum import get_checksum
>>> get_checksum("tst/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="SHA1")
'2fd4e1c67a2d28fced849ee1bb76e7391b93eb12'
>>> get_checksum("tst/file.txt", algorithm="SHA256")
'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
>>> get_checksum("tst/file.txt", algorithm="SHA384")
'ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1'
>>> get_checksum("tst/file.txt", algorithm="SHA512")
'07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6'
航站楼:
$ simple-file-checksum tst/file.txt
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a MD5
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a SHA1
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
$ simple-file-checksum tst/file.txt -a SHA256
d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592
$ simple-file-checksum tst/file.txt -a SHA384
ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1
$ simple-file-checksum tst/file.txt -a SHA512
07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6