什么是 _md5.md5,为什么 hashlib.md5 这么慢?
What is _md5.md5 and why is hashlib.md5 so much slower?
在对缓慢的 stdlib hashlib.md5
实现感到沮丧时发现了这个未记录的 _md5
。
在 Macbook 上:
>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>
在 Windows 盒子上:
>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>
在 Linux 盒子上:
>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>
对于散列短消息,速度要快得多。对于长消息,性能相似。
为什么它隐藏在下划线扩展模块中,为什么 hashlib 没有默认使用这个更快的实现? 什么是 _md5
模块 为什么它没有 public API?
Python public 模块将方法委托给隐藏模块是很常见的。
例如collections.abc
模块的完整代码为:
from _collections_abc import *
from _collections_abc import __all__
The functions of hashlib
are dynamically created:
for __func_name in __always_supported:
# try them all, some may not work due to the OpenSSL
# version not supporting that algorithm.
try:
globals()[__func_name] = __get_hash(__func_name)
The definition of always_supported
is:
__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
'blake2b', 'blake2s',
'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
'shake_128', 'shake_256')
And get_hash
__get_openssl_constructor
或 __get_builtin_constructor
:
try:
import _hashlib
new = __hash_new
__get_hash = __get_openssl_constructor
algorithms_available = algorithms_available.union(
_hashlib.openssl_md_meth_names)
except ImportError:
new = __py_new
__get_hash = __get_builtin_constructor
__get_builtin_constructor
is a fallback for the (again) hidden _hashlib
module:
def __get_openssl_constructor(name):
if name in __block_openssl_constructor:
# Prefer our blake2 and sha3 implementation.
return __get_builtin_constructor(name)
try:
f = getattr(_hashlib, 'openssl_' + name)
# Allow the C module to raise ValueError. The function will be
# defined but the hash not actually available thanks to OpenSSL.
f()
# Use the C function directly (very fast)
return f
except (AttributeError, ValueError):
return __get_builtin_constructor(name)
在 hashlib
code 上面,你有这个:
def __get_builtin_constructor(name):
cache = __builtin_constructor_cache
...
elif name in {'MD5', 'md5'}:
import _md5
cache['MD5'] = cache['md5'] = _md5.md5
但是 md5
不在 __block_openssl_constructor
中,因此 _hashlib/openssl
版本优于 _md5/builtin
版本:
REPL 中的确认:
>>> hashlib.md5
<built-in function openssl_md5>
>>> _md5.md5
<built-in function md5>
这些函数是 MD5 算法的不同实现,openssl_md5
调用动态系统库。这就是为什么你有一些性能变化。第一个版本定义在https://github.com/python/cpython/blob/master/Modules/_hashopenssl.c and the other in https://github.com/python/cpython/blob/master/Modules/md5module.c,如果你想检查差异。
那为什么_md5.md5
函数定义了却没有使用呢?我想这个想法是为了确保某些算法始终可用,即使 openssl
不存在也是如此:
Constructors for hash algorithms that are always present in this module are sha1(), sha224(), sha256(), sha384(), sha512(), blake2b(), and blake2s(). (https://docs.python.org/3/library/hashlib.html)
我环顾 bugs.python.org 和阅读 cpython git 提交历史的理论:
cpython 在 2005 年切换到 openssl md5,因为它比内置实现更快。他们在 2007 年添加了一个比 openssl 更快但从未切换回来的新内置实现。这两项更改均由 Gregory P. Smith 完成。
这是我的证据。
- 2005 年,Greg 创建了 the bpo issue "sha and md5 modules should use OpenSSL when possible". This change made in this commit。
- 2007 年,Greg 在 this commit 中添加了新的快速 md5 模块。
_md5
实现在 Python 3.8 中似乎基本相同(我正在查看提交 ea316fd21527)
我认为当 _md5
可用时,cpython 维护者可能会愿意切换回 _md5
,因为 openssl 实现更快不再是事实(并且在过去 13 年中可能是不正确的) ).
直到 Python 2.5,散列和摘要在它们自己的模块中实现(例如 [Python 2.Docs]: md5 - MD5 message digest algorithm).
Starting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests 已添加。其目的是:
- 提供对哈希/摘要的统一访问方法(通过它们的名称)
- 切换(默认)到外部密码提供者(委托给专门从事该领域的某个实体似乎是合乎逻辑的步骤,因为维护所有这些算法可能是一种矫枉过正).当时 OpenSSL 是最好的选择:足够成熟,知名度和兼容性(有一堆类似的 Java 提供商,但是那些没用)
作为#2.的副作用,Python public API 中隐藏了实现(重命名为:_md5、_sha1, _sha256, _sha512, 后面加了: _blake2, _sha3),因为冗余通常会造成混淆。
但是,另一个副作用是 _hashlib.so 依赖于 OpenSSL 的 libcrypto*.so(这是 Nix(至少 Lnx)特定的,在 Win,一个静态 libeay32.lib 被链接到 _hashlib.pyd,还有 _ssl.pyd(我认为是蹩脚的),直到 v3.7+,其中 OpenSSL .dlls 是 Python 安装的一部分)。
可能在 90+% 的机器上一切顺利,因为默认安装了 OpenSSL,但对于那些它不是,很多东西可能会被破坏,因为例如 hashlib 被许多模块导入(一个这样的例子是 random 它本身被导入很多其他的),所以 与密码学完全无关的琐碎代码(至少不是在 1st 视线中)将停止工作 .这就是保留旧实现的原因(但同样,它们只是后备,因为 OpenSSL 版本/应该得到更好的维护)。
[cfati@cfati-ubtu16x64-0:~/Work/Dev/Whosebug/q059955854]> ~/sopr.sh
*** Set shorter prompt to better fit when pasted in Whosebug (or other) pages ***
[064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\"{0:}\n{1:}\n{2:}\n{3:}\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))"
3.5.2 (default, Oct 8 2019, 13:06:37)
[GCC 5.4.0 20160609]
<module '_md5' (built-in)>
<module '_hashlib' from '/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so'>
OpenSSL 1.0.2g 1 Mar 2016
[064bit-prompt]>
[064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so
linux-vdso.so.1 => (0x00007fffa7d0b000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000)
libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f50da271000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000)
[064bit-prompt]>
[064bit-prompt]> openssl version -a
OpenSSL 1.0.2g 1 Mar 2016
built on: reproducible build, date unspecified
platform: debian-amd64
options: bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx)
compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
OPENSSLDIR: "/usr/lib/ssl"
[064bit-prompt]>
[064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\"A\").hexdigest(), hl.md5(b\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29
根据[Python 3.Docs]: hashlib.algorithms_guaranteed:
A set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that ‘md5’ is in this list despite some upstream vendors offering an odd “FIPS compliant” Python build that excludes it.
下面是自定义Python2.7安装的例子(我很久以前建的,值得一提它动态链接到 OpenSSL .dlls):
e:\Work\Dev\Whosebug\q059955854>sopr.bat
*** Set shorter prompt to better fit when pasted in Whosebug (or other) pages ***
[prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import sys, ssl;print(\"{0:}\n{1:}\".format(sys.version, ssl.OPENSSL_VERSION))"
2.7.10 (default, Mar 8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]
OpenSSL 1.0.2j-fips 26 Sep 2016
[prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29
[prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips
关于速度问题我只能推测:
- Python 实现(显然)是专门为 Python 编写的,这意味着它是 "more optimized"(是的,这在语法上是不正确的)对于 Python 而不是通用版本,并且还驻留在 python*.so(或 python 可执行文件本身)
- OpenSSL 实现驻留在 libcrypto*.so 中,它被包装器 [=233= 访问],它在 Python 类型(PyObject*)和 OpenSSL 个 (EVP_MD_CTX*)
考虑到上述情况,前者(稍微)更快(至少对于小消息而言,开销(函数调用和其他 Python 底层操作)与散列本身相比占用了总时间的很大一部分)。还有其他因素需要考虑(例如,是否使用了 OpenSSL 汇编程序加速)。
更新#0
以下是我自己的一些基准。
code00.py:
#!/usr/bin/env python
import sys
from hashlib import md5 as md5_openssl
from _md5 import md5 as md5_builtin
import timeit
def main(*argv):
base_text = b"A"
number = 1000000
print("timeit attempts number: {0:d}".format(number))
#x = []
#y = {}
for count in range(0, 16):
factor = 2 ** count
text = base_text * factor
globals_dict = {"text": text}
#x.append(factor)
print("\nUsing a {0:8d} (2 ** {1:2d}) bytes message".format(len(text), count))
for func in [
md5_openssl,
md5_builtin,
]:
globals_dict["md5"] = func
t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)
print(" {0:12s} took: {1:11.6f} seconds".format(func.__name__, t))
#y.setdefault(func.__name__, []).append(t)
#print(x, y)
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
main(*sys.argv[1:])
print("\nDone.")
输出:
Win 10 pc064(运行在 Dell Precision 5510 笔记本电脑上运行):
[prompt]> "e:\Work\Dev\VEnvs\py_pc064_03.07.06_test0\Scripts\python.exe" code00.py
Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32
timeit attempts number: 1000000
Using a 1 (2 ** 0) bytes message
openssl_md5 took: 0.449134 seconds
md5 took: 0.120021 seconds
Using a 2 (2 ** 1) bytes message
openssl_md5 took: 0.460399 seconds
md5 took: 0.118555 seconds
Using a 4 (2 ** 2) bytes message
openssl_md5 took: 0.451850 seconds
md5 took: 0.121166 seconds
Using a 8 (2 ** 3) bytes message
openssl_md5 took: 0.438398 seconds
md5 took: 0.118127 seconds
Using a 16 (2 ** 4) bytes message
openssl_md5 took: 0.454653 seconds
md5 took: 0.122818 seconds
Using a 32 (2 ** 5) bytes message
openssl_md5 took: 0.450776 seconds
md5 took: 0.118594 seconds
Using a 64 (2 ** 6) bytes message
openssl_md5 took: 0.555761 seconds
md5 took: 0.278812 seconds
Using a 128 (2 ** 7) bytes message
openssl_md5 took: 0.681296 seconds
md5 took: 0.455921 seconds
Using a 256 (2 ** 8) bytes message
openssl_md5 took: 0.895952 seconds
md5 took: 0.807457 seconds
Using a 512 (2 ** 9) bytes message
openssl_md5 took: 1.401584 seconds
md5 took: 1.499279 seconds
Using a 1024 (2 ** 10) bytes message
openssl_md5 took: 2.360966 seconds
md5 took: 2.878650 seconds
Using a 2048 (2 ** 11) bytes message
openssl_md5 took: 4.383245 seconds
md5 took: 5.655477 seconds
Using a 4096 (2 ** 12) bytes message
openssl_md5 took: 8.264774 seconds
md5 took: 10.920909 seconds
Using a 8192 (2 ** 13) bytes message
openssl_md5 took: 15.521947 seconds
md5 took: 21.895179 seconds
Using a 16384 (2 ** 14) bytes message
openssl_md5 took: 29.947287 seconds
md5 took: 43.198639 seconds
Using a 32768 (2 ** 15) bytes message
openssl_md5 took: 59.123447 seconds
md5 took: 86.453821 seconds
Done.
Ubtu 16 pc064(VM 运行宁在VirtualBox 在上面的机器上):
[064bit-prompt]> python3 code00.py
Python 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux
timeit attempts number: 1000000
Using a 1 (2 ** 0) bytes message
openssl_md5 took: 0.246166 seconds
md5 took: 0.130589 seconds
Using a 2 (2 ** 1) bytes message
openssl_md5 took: 0.251019 seconds
md5 took: 0.127750 seconds
Using a 4 (2 ** 2) bytes message
openssl_md5 took: 0.257018 seconds
md5 took: 0.123116 seconds
Using a 8 (2 ** 3) bytes message
openssl_md5 took: 0.245399 seconds
md5 took: 0.128267 seconds
Using a 16 (2 ** 4) bytes message
openssl_md5 took: 0.251832 seconds
md5 took: 0.136373 seconds
Using a 32 (2 ** 5) bytes message
openssl_md5 took: 0.248410 seconds
md5 took: 0.140708 seconds
Using a 64 (2 ** 6) bytes message
openssl_md5 took: 0.361016 seconds
md5 took: 0.267021 seconds
Using a 128 (2 ** 7) bytes message
openssl_md5 took: 0.478735 seconds
md5 took: 0.413986 seconds
Using a 256 (2 ** 8) bytes message
openssl_md5 took: 0.707602 seconds
md5 took: 0.695042 seconds
Using a 512 (2 ** 9) bytes message
openssl_md5 took: 1.216832 seconds
md5 took: 1.268570 seconds
Using a 1024 (2 ** 10) bytes message
openssl_md5 took: 2.122014 seconds
md5 took: 2.429623 seconds
Using a 2048 (2 ** 11) bytes message
openssl_md5 took: 4.158188 seconds
md5 took: 4.847686 seconds
Using a 4096 (2 ** 12) bytes message
openssl_md5 took: 7.839173 seconds
md5 took: 9.242224 seconds
Using a 8192 (2 ** 13) bytes message
openssl_md5 took: 15.282232 seconds
md5 took: 18.368874 seconds
Using a 16384 (2 ** 14) bytes message
openssl_md5 took: 30.681912 seconds
md5 took: 36.755073 seconds
Using a 32768 (2 ** 15) bytes message
openssl_md5 took: 60.230543 seconds
md5 took: 73.237356 seconds
Done.
结果好像和你的很不一样。就我而言:
- 从 [~512B 的某处开始 .. ~1KiB] 大小的消息,OpenSSL实现似乎比内置的更好
- 我知道结果太少无法声明一个模式,但似乎这两种实现似乎都与消息大小成线性比例(就时间而言)(但内置斜率似乎有点陡峭 -这意味着它在长 运行)
上的表现会更差
总而言之,如果您的所有消息都很小,并且内置实现最适合您,请使用它。
更新#1
图形表示(我不得不将 timeit 迭代次数减少一个数量级,因为对于大消息来说它会花费太长时间):
并放大 2 个图形相交的区域:
在对缓慢的 stdlib hashlib.md5
实现感到沮丧时发现了这个未记录的 _md5
。
在 Macbook 上:
>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>
在 Windows 盒子上:
>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>
在 Linux 盒子上:
>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>
对于散列短消息,速度要快得多。对于长消息,性能相似。
为什么它隐藏在下划线扩展模块中,为什么 hashlib 没有默认使用这个更快的实现? 什么是 _md5
模块 为什么它没有 public API?
Python public 模块将方法委托给隐藏模块是很常见的。
例如collections.abc
模块的完整代码为:
from _collections_abc import *
from _collections_abc import __all__
The functions of hashlib
are dynamically created:
for __func_name in __always_supported:
# try them all, some may not work due to the OpenSSL
# version not supporting that algorithm.
try:
globals()[__func_name] = __get_hash(__func_name)
The definition of always_supported
is:
__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
'blake2b', 'blake2s',
'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
'shake_128', 'shake_256')
And get_hash
__get_openssl_constructor
或 __get_builtin_constructor
:
try:
import _hashlib
new = __hash_new
__get_hash = __get_openssl_constructor
algorithms_available = algorithms_available.union(
_hashlib.openssl_md_meth_names)
except ImportError:
new = __py_new
__get_hash = __get_builtin_constructor
__get_builtin_constructor
is a fallback for the (again) hidden _hashlib
module:
def __get_openssl_constructor(name):
if name in __block_openssl_constructor:
# Prefer our blake2 and sha3 implementation.
return __get_builtin_constructor(name)
try:
f = getattr(_hashlib, 'openssl_' + name)
# Allow the C module to raise ValueError. The function will be
# defined but the hash not actually available thanks to OpenSSL.
f()
# Use the C function directly (very fast)
return f
except (AttributeError, ValueError):
return __get_builtin_constructor(name)
在 hashlib
code 上面,你有这个:
def __get_builtin_constructor(name):
cache = __builtin_constructor_cache
...
elif name in {'MD5', 'md5'}:
import _md5
cache['MD5'] = cache['md5'] = _md5.md5
但是 md5
不在 __block_openssl_constructor
中,因此 _hashlib/openssl
版本优于 _md5/builtin
版本:
REPL 中的确认:
>>> hashlib.md5
<built-in function openssl_md5>
>>> _md5.md5
<built-in function md5>
这些函数是 MD5 算法的不同实现,openssl_md5
调用动态系统库。这就是为什么你有一些性能变化。第一个版本定义在https://github.com/python/cpython/blob/master/Modules/_hashopenssl.c and the other in https://github.com/python/cpython/blob/master/Modules/md5module.c,如果你想检查差异。
那为什么_md5.md5
函数定义了却没有使用呢?我想这个想法是为了确保某些算法始终可用,即使 openssl
不存在也是如此:
Constructors for hash algorithms that are always present in this module are sha1(), sha224(), sha256(), sha384(), sha512(), blake2b(), and blake2s(). (https://docs.python.org/3/library/hashlib.html)
我环顾 bugs.python.org 和阅读 cpython git 提交历史的理论:
cpython 在 2005 年切换到 openssl md5,因为它比内置实现更快。他们在 2007 年添加了一个比 openssl 更快但从未切换回来的新内置实现。这两项更改均由 Gregory P. Smith 完成。
这是我的证据。
- 2005 年,Greg 创建了 the bpo issue "sha and md5 modules should use OpenSSL when possible". This change made in this commit。
- 2007 年,Greg 在 this commit 中添加了新的快速 md5 模块。
_md5
实现在 Python 3.8 中似乎基本相同(我正在查看提交 ea316fd21527)
我认为当 _md5
可用时,cpython 维护者可能会愿意切换回 _md5
,因为 openssl 实现更快不再是事实(并且在过去 13 年中可能是不正确的) ).
直到 Python 2.5,散列和摘要在它们自己的模块中实现(例如 [Python 2.Docs]: md5 - MD5 message digest algorithm).
Starting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests 已添加。其目的是:
- 提供对哈希/摘要的统一访问方法(通过它们的名称)
- 切换(默认)到外部密码提供者(委托给专门从事该领域的某个实体似乎是合乎逻辑的步骤,因为维护所有这些算法可能是一种矫枉过正).当时 OpenSSL 是最好的选择:足够成熟,知名度和兼容性(有一堆类似的 Java 提供商,但是那些没用)
作为#2.的副作用,Python public API 中隐藏了实现(重命名为:_md5、_sha1, _sha256, _sha512, 后面加了: _blake2, _sha3),因为冗余通常会造成混淆。
但是,另一个副作用是 _hashlib.so 依赖于 OpenSSL 的 libcrypto*.so(这是 Nix(至少 Lnx)特定的,在 Win,一个静态 libeay32.lib 被链接到 _hashlib.pyd,还有 _ssl.pyd(我认为是蹩脚的),直到 v3.7+,其中 OpenSSL .dlls 是 Python 安装的一部分)。
可能在 90+% 的机器上一切顺利,因为默认安装了 OpenSSL,但对于那些它不是,很多东西可能会被破坏,因为例如 hashlib 被许多模块导入(一个这样的例子是 random 它本身被导入很多其他的),所以 与密码学完全无关的琐碎代码(至少不是在 1st 视线中)将停止工作 .这就是保留旧实现的原因(但同样,它们只是后备,因为 OpenSSL 版本/应该得到更好的维护)。
[cfati@cfati-ubtu16x64-0:~/Work/Dev/Whosebug/q059955854]> ~/sopr.sh *** Set shorter prompt to better fit when pasted in Whosebug (or other) pages *** [064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\"{0:}\n{1:}\n{2:}\n{3:}\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))" 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] <module '_md5' (built-in)> <module '_hashlib' from '/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so'> OpenSSL 1.0.2g 1 Mar 2016 [064bit-prompt]> [064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so linux-vdso.so.1 => (0x00007fffa7d0b000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000) libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000) /lib64/ld-linux-x86-64.so.2 (0x00007f50da271000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000) [064bit-prompt]> [064bit-prompt]> openssl version -a OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified platform: debian-amd64 options: bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM OPENSSLDIR: "/usr/lib/ssl" [064bit-prompt]> [064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\"A\").hexdigest(), hl.md5(b\"A\").hexdigest())" 7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29
根据[Python 3.Docs]: hashlib.algorithms_guaranteed:
A set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that ‘md5’ is in this list despite some upstream vendors offering an odd “FIPS compliant” Python build that excludes it.
下面是自定义Python2.7安装的例子(我很久以前建的,值得一提它动态链接到 OpenSSL .dlls):
e:\Work\Dev\Whosebug\q059955854>sopr.bat *** Set shorter prompt to better fit when pasted in Whosebug (or other) pages *** [prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import sys, ssl;print(\"{0:}\n{1:}\".format(sys.version, ssl.OPENSSL_VERSION))" 2.7.10 (default, Mar 8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)] OpenSSL 1.0.2j-fips 26 Sep 2016 [prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import hashlib as hl;print(hl.md5(\"A\").hexdigest())" 7fc56270e7a70fa81a5935b72eacbe29 [prompt]> "F:\Install\pc064\HPE\OPSWpython.7.10__00\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\"A\").hexdigest())" Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips
关于速度问题我只能推测:
- Python 实现(显然)是专门为 Python 编写的,这意味着它是 "more optimized"(是的,这在语法上是不正确的)对于 Python 而不是通用版本,并且还驻留在 python*.so(或 python 可执行文件本身)
- OpenSSL 实现驻留在 libcrypto*.so 中,它被包装器 [=233= 访问],它在 Python 类型(PyObject*)和 OpenSSL 个 (EVP_MD_CTX*)
考虑到上述情况,前者(稍微)更快(至少对于小消息而言,开销(函数调用和其他 Python 底层操作)与散列本身相比占用了总时间的很大一部分)。还有其他因素需要考虑(例如,是否使用了 OpenSSL 汇编程序加速)。
更新#0
以下是我自己的一些基准。
code00.py:
#!/usr/bin/env python
import sys
from hashlib import md5 as md5_openssl
from _md5 import md5 as md5_builtin
import timeit
def main(*argv):
base_text = b"A"
number = 1000000
print("timeit attempts number: {0:d}".format(number))
#x = []
#y = {}
for count in range(0, 16):
factor = 2 ** count
text = base_text * factor
globals_dict = {"text": text}
#x.append(factor)
print("\nUsing a {0:8d} (2 ** {1:2d}) bytes message".format(len(text), count))
for func in [
md5_openssl,
md5_builtin,
]:
globals_dict["md5"] = func
t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)
print(" {0:12s} took: {1:11.6f} seconds".format(func.__name__, t))
#y.setdefault(func.__name__, []).append(t)
#print(x, y)
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
main(*sys.argv[1:])
print("\nDone.")
输出:
Win 10 pc064(运行在 Dell Precision 5510 笔记本电脑上运行):
[prompt]> "e:\Work\Dev\VEnvs\py_pc064_03.07.06_test0\Scripts\python.exe" code00.py Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32 timeit attempts number: 1000000 Using a 1 (2 ** 0) bytes message openssl_md5 took: 0.449134 seconds md5 took: 0.120021 seconds Using a 2 (2 ** 1) bytes message openssl_md5 took: 0.460399 seconds md5 took: 0.118555 seconds Using a 4 (2 ** 2) bytes message openssl_md5 took: 0.451850 seconds md5 took: 0.121166 seconds Using a 8 (2 ** 3) bytes message openssl_md5 took: 0.438398 seconds md5 took: 0.118127 seconds Using a 16 (2 ** 4) bytes message openssl_md5 took: 0.454653 seconds md5 took: 0.122818 seconds Using a 32 (2 ** 5) bytes message openssl_md5 took: 0.450776 seconds md5 took: 0.118594 seconds Using a 64 (2 ** 6) bytes message openssl_md5 took: 0.555761 seconds md5 took: 0.278812 seconds Using a 128 (2 ** 7) bytes message openssl_md5 took: 0.681296 seconds md5 took: 0.455921 seconds Using a 256 (2 ** 8) bytes message openssl_md5 took: 0.895952 seconds md5 took: 0.807457 seconds Using a 512 (2 ** 9) bytes message openssl_md5 took: 1.401584 seconds md5 took: 1.499279 seconds Using a 1024 (2 ** 10) bytes message openssl_md5 took: 2.360966 seconds md5 took: 2.878650 seconds Using a 2048 (2 ** 11) bytes message openssl_md5 took: 4.383245 seconds md5 took: 5.655477 seconds Using a 4096 (2 ** 12) bytes message openssl_md5 took: 8.264774 seconds md5 took: 10.920909 seconds Using a 8192 (2 ** 13) bytes message openssl_md5 took: 15.521947 seconds md5 took: 21.895179 seconds Using a 16384 (2 ** 14) bytes message openssl_md5 took: 29.947287 seconds md5 took: 43.198639 seconds Using a 32768 (2 ** 15) bytes message openssl_md5 took: 59.123447 seconds md5 took: 86.453821 seconds Done.
Ubtu 16 pc064(VM 运行宁在VirtualBox 在上面的机器上):
[064bit-prompt]> python3 code00.py Python 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux timeit attempts number: 1000000 Using a 1 (2 ** 0) bytes message openssl_md5 took: 0.246166 seconds md5 took: 0.130589 seconds Using a 2 (2 ** 1) bytes message openssl_md5 took: 0.251019 seconds md5 took: 0.127750 seconds Using a 4 (2 ** 2) bytes message openssl_md5 took: 0.257018 seconds md5 took: 0.123116 seconds Using a 8 (2 ** 3) bytes message openssl_md5 took: 0.245399 seconds md5 took: 0.128267 seconds Using a 16 (2 ** 4) bytes message openssl_md5 took: 0.251832 seconds md5 took: 0.136373 seconds Using a 32 (2 ** 5) bytes message openssl_md5 took: 0.248410 seconds md5 took: 0.140708 seconds Using a 64 (2 ** 6) bytes message openssl_md5 took: 0.361016 seconds md5 took: 0.267021 seconds Using a 128 (2 ** 7) bytes message openssl_md5 took: 0.478735 seconds md5 took: 0.413986 seconds Using a 256 (2 ** 8) bytes message openssl_md5 took: 0.707602 seconds md5 took: 0.695042 seconds Using a 512 (2 ** 9) bytes message openssl_md5 took: 1.216832 seconds md5 took: 1.268570 seconds Using a 1024 (2 ** 10) bytes message openssl_md5 took: 2.122014 seconds md5 took: 2.429623 seconds Using a 2048 (2 ** 11) bytes message openssl_md5 took: 4.158188 seconds md5 took: 4.847686 seconds Using a 4096 (2 ** 12) bytes message openssl_md5 took: 7.839173 seconds md5 took: 9.242224 seconds Using a 8192 (2 ** 13) bytes message openssl_md5 took: 15.282232 seconds md5 took: 18.368874 seconds Using a 16384 (2 ** 14) bytes message openssl_md5 took: 30.681912 seconds md5 took: 36.755073 seconds Using a 32768 (2 ** 15) bytes message openssl_md5 took: 60.230543 seconds md5 took: 73.237356 seconds Done.
结果好像和你的很不一样。就我而言:
- 从 [~512B 的某处开始 .. ~1KiB] 大小的消息,OpenSSL实现似乎比内置的更好
- 我知道结果太少无法声明一个模式,但似乎这两种实现似乎都与消息大小成线性比例(就时间而言)(但内置斜率似乎有点陡峭 -这意味着它在长 运行) 上的表现会更差
总而言之,如果您的所有消息都很小,并且内置实现最适合您,请使用它。
更新#1
图形表示(我不得不将 timeit 迭代次数减少一个数量级,因为对于大消息来说它会花费太长时间):
并放大 2 个图形相交的区域: