在 Python3 中重新创建 JS 按位整数处理

Recreating JS bitwise integer handling in Python3

我需要将哈希函数从 JavaScript 转换为 Python。

函数如下:

function getIndex(string) {
        var length = 27;
        string = string.toLowerCase();
        var hash = 0;
        for (var i = 0; i < string.length; i++) {
                hash = string.charCodeAt(i) + (hash << 6) + (hash << 16) - hash;
        }
        var index = Math.abs(hash % length);
        return index;
}

console.log(getIndex(window.prompt("Enter a string to hash")));

此功能是 Objectively Correct™。它本身就是完美的。我无法改变它,我只能重新创建它。无论它输出什么,我的 Python 脚本也必须输出。

但是 - 我有几个问题,我认为这与这两种语言处理有符号整数的方式有关。

JS 按位运算符将其操作数视为 32 位序列。然而,Python 没有位限制的概念,只是像一个绝对的疯子一样继续前进。我认为这是两种语言之间的一个重要区别。

我可以通过使用 hash & 0xFFFFFFFF.

将其屏蔽为 32 位来限制 Python 中 hash 的长度

我也可以否定 hash 如果它高于 0x7FFFFFFFhash = hash ^ 0xFFFFFFFF(或 hash = ~hash - 他们似乎都做同样的事情)。我相信这模拟了负数。

我使用一个名为 t.

的函数将这两个限制应用于哈希

这是我目前的 Python 代码:

def nickColor(string):
    length = 27

    def t(x):
        x = x & 0xFFFFFFFF
        if x > 0x7FFFFFFF:
            x = x ^ 0xFFFFFFFF
        return x

    string = string.lower()
    hash = t(0)
    for letter in string:
        hash = t(hash)
        hash = t(t(ord(letter)) + t(hash << 6) + t(hash << 16) - t(hash))
    index = hash % length
    return index

它似乎一直工作到散列需要变为负值为止,此时两个脚本出现分歧。这通常发生在字符串中大约 4 个字母处。

我假设我的问题在于在 Python 中重新创建 JS 负数。我该如何告别这个问题?

这是一个有效的翻译:

def nickColor(string):
    length = 27

    def t(x):
        x &= 0xFFFF_FFFF
        if x > 0x7FFF_FFFF:
            x -= 0x1_0000_0000
        return float(x)

    bytes = string.lower().encode('utf-16-le')
    hash = 0.0
    for i in range(0, len(bytes), 2):
        char_code = bytes[i] + 256*bytes[i+1]
        hash = char_code + t(int(hash) << 6) + t(int(hash) << 16) - hash
    return int(hash % length if hash >= 0 else abs(hash % length - length))

关键是,只有移位(<<)被计算为32位整数运算,它们的结果是converted back to double before entering additions and subtractions. I'm not familiar with the rules for double-precision floating point representation in the two languages, but it's safe to assume that on all personal computing devices and web servers it is the same for both languages, namely double-precision IEEE 754. For very long strings (thousands of characters) the hash could lose some bits of precision, which of course affects the final result, but in the same way in JS as in Python (not what the author of the Objectively Correct™ function intended, but that's the way it is…). The last line corrects for the different definition of the % operator for negative operands in JavaScript and Python

此外(感谢Mark Ransom提醒),要完全模拟JavaScript,还需要考虑它的编码,它是UTF-16,但是surrogate pairs handled as if they consisted of 2 characters. Encoding the string as utf-16-le you make sure that the first byte in each 16-bit “word” is the least significant one, plus, you don't get the BOM你如果您使用 utf-16 兜售法院(谢谢 Martijn Pieters),就会得到。