在 Python 中迭代大量数据的有效方法

Question

我正在尝试运行对 sha512 哈希进行字典攻击。我知道散列由两个单词组成，全部小写，由 space 分隔。这些词来自一本已知的词典 (02-dictionary.txt)，其中包含 172,820 个词。目前，我的代码如下：

import hashlib
import sys
import time

def crack_hash(word, target):
    dict_hash = hashlib.sha512(word.encode())
    if dict_hash.hexdigest() == target:
        return (True, word)
    else:
        return (False, None)

if __name__ == "__main__":
    target_hash = sys.argv[1].strip()
    
    fp = open("02-dictionary.txt", "r")

    words = []
    start_time = time.time()
    for word in fp:
        words.append(word)
    fp.close()

    for word1 in words:
        for word2 in words:
            big_word = word1.strip() + " " + word2.strip()
            print(big_word)
            soln_found, soln_word = crack_hash(big_word.strip(), target_hash)
            if soln_found:
                print('Solution found')
                print("The word was:", soln_word)
                break

    end_time = time.time()
    total_time = end_time - start_time
    print("Time taken:", round(total_time, 5), "seconds")

但是，当我运行这段代码时，程序运行非常慢。我知道 Python 不是最高效的语言，但我猜这个问题更多是源于数据结构的选择。有没有更高效的数据结构？我尝试使用 array 模块，但文档使它看起来好像是设计用于更原始的类型（整数、浮点数、短裤、布尔值、字符等），而不是用于更复杂的类型像字符串（或字符列表）。改进此代码的最佳方法是什么？在大约一个小时的运行时间里，我只完成了大约 1% 的所有可能的单词组合。

Answer 1

问题是您正在计算 178000² = 31684000000（大约 2³⁵）个哈希值。这是很多工作。我做了一些更改以在纯 python 中实现一些优化，但我怀疑 hashlib 调用的开销非常大。我认为在本机代码中全部执行此操作会带来更显着的加速。

优化包括以下内容：

将字典中的单词预计算为字节对象
预计算散列第一部分的部分散列结果

import hashlib
import sys
import time


def try_all(words, target_hash):
    for word1 in words:
        hash_prefix = hashlib.sha512(word1 + b' ')
        for word2 in words:
            prefix_copy = hash_prefix.copy()
            prefix_copy.update(word2)
            # print(big_word)
            if prefix_copy.digest() == target_hash:
                print('Solution found')
                big_word = (word1 + b' ' + word2).decode('utf8')
                print(f'The word was: {big_word}')
                return


def read_all_words(filename):
    with open(filename, "rt") as f:
        return [line.strip().encode('utf-8') for line in f]


def get_test_hash(words):
    phrase = words[-2] + b' ' + words[-1]  # pick target towards end
    return hashlib.sha512(phrase).digest()


if __name__ == "__main__":
    words = read_all_words("02-dictionary.txt")
    TESTING = True
    if TESTING:
        words = words[:5000]  # reduce the size of the word list for testing only
        target_hash = get_test_hash(words)
    else:
        target_hash = bytes.fromhex(sys.argv[1].strip())
    start_time = time.time()
    try_all(words, target_hash)
    end_time = time.time()
    total_time = end_time - start_time
    print(f"Time taken: {round(total_time, 5)} seconds")
    print(f'{total_time / pow(len(words), 2)} seconds per hash')

在我的笔记本电脑上，每次哈希大约需要 1.1 * 10^-6 秒，因此尝试字典中的所有单词将花费不到 10 小时的时间 CPU时间。

在 Python 中迭代大量数据的有效方法

Efficient way of iterating through large amounts of data in Python

python

cryptography

sha512

data-structures

dictionary-attack