如何使 2 个具有相同单词/含义但具有 Unicode 差异的字符串散列为相同的 ID？

Question

对于我正在从事的网络抓取项目，我计划将实体存储在数据库中，其中它们的 ID 是 name/title.

的 md5 哈希

但是，由于字符串中存在 Unicode，因此将存在相同 name/title 的不同哈希值

例如，“Kinesiology, Phys Ed\xa0and Recreation”的 md5 hash 将不同于“Kinesiology, Phys Ed and Recreation”。

我尝试使用 Unicode 规范化，但哈希值之间的差异仍然相同

import hashlib
import unicodedata


def generate_id(*args):
    """

    :param args: strings to be used to generate an id
    :return: md5 hash of the passed arguments
    """
    string = ''
    for arg in args:
        string += ' ' + arg
    hash_algorithm = hashlib.md5()
    hash_algorithm.update(string.encode('utf-8'))
    return hash_algorithm.hexdigest()


def clean_text(text):
    """
    normalizes the unicode in a text to be more readable and generate a more accurate id from
    :param text: string to be normalized
    :return: normalized version of text
    """
    return unicodedata.normalize('NFC', text)


print(generate_id(clean_text('Kinesiology, Phys Ed\xa0and Recreation'))) # hashes to acd21f3b094a77d1a2393a8daeac42d9
print(generate_id('Kinesiology, Phys Ed and Recreation')) # hashes to 5ac6bc3ca3d743d99e9b93a7a5379fe9

我该怎么做才能确保两个字符串相同并哈希到相同的 id，这样 'Kinesiology, Phys Ed\xa0and Recreation' 是与 'Kinesiology, Phys Ed and Recreation' 相同的字符串和相同的哈希（与任何 2字符串，不管是否存在 unicode)?

Answer 1

由于“具有相同的哈希值”只是二进制相等性的代表，因此您需要的是将字符串规范化为相同。

在 Unicode 术语中，给定的两个字符串规范上不等价，但它们兼容。因此，您将能够在 clean_text() 函数中使用兼容性 decomposition/composition 范式（NFKD 或 NFKC）生成相同的散列：

def clean_text(text):
    return unicodedata.normalize('NFKD', text)

NO-BREAK SPACE (U+00A0)字符的分解属性设置为<noBreak> SPACE (U+0020)。分解属性中存在关键字这一事实（在本例中为 <noBreak>）表明该字符与常规 space 字符兼容，但不是规范等价的。

旁注

由于在评论中要求，对 NFKC 和 NFKD 范式之间的区别进行一些澄清：

Unicode 字符可以由多个代码点组成。某些字符可以用不同（但规范上等效）的方式表示：作为单个代码点或作为代码点的组合。例如：é 可以表示为 é 或 e + ◌́。归一化时，复合范式（NFC、NFKC）将尝试将序列转换为其复合形式（e + ◌́ → é）；分解范式 (NFD, NFKD) 将尝试将组合字符转换为序列 (é → e + ◌́)。您使用哪一个完全取决于具体情况。请确保不要将苹果与橙子进行比较。

如何使 2 个具有相同单词/含义但具有 Unicode 差异的字符串散列为相同的 ID？

How can I make 2 strings with the same word(s)/meaning, but with Unicode differences hash to the same id?

python

unicode

hash

md5

python-unicode