Javascript Murmurhash3 的实现给出与 Murmurhash3.cpp 相同的结果,由 Python 的 sklearn 中可用的转换使用

Javascript implementation of Murmurhash3 to give the same result as Murmurhash3.cpp used by transform available in Python's sklearn

(非常抱歉,我不允许添加很多 URL 来帮助我更好地解释我在这个 post 中的问题,因为我是 Whosebug 的新手,而且我的 Whosebug 帐户权限非常低)。

摘要

任何人都可以指导我如何修改 murmurhash3.js(下图),以便它生成与 MurmurHash3.cpp(下图)相同的哈希值吗?我根据需要为 MurmurHash3.cpp 提供了一个简单的 python 代码 "simple_python_wrapper.py"。如果您安装了 sklearn,simple_python_wrapper.py 应该可以在您的计算机上运行。

我在使用 sklearn(一个 Python 机器学习库)中的 transform 时大量使用 Murmurhash3.cpp(如下所示):from sklearn.feature_extraction._hashing import transform 在我的一个机器学习中项目。 transform 在 sklearn 的 implementation/import 树的深处使用 Murmurhash3.cpp

更多详情

hash % (2^18) {即"hash modulus 2^18"}基于MurmurHash3.cpp

"hello" gives 260679
"there" gives 45525

hash % (2^18) {即 "hash modulus 2^18"} 基于 murmurhash3.js

"hello" gives -58999
"there" gives 65775

murmurhash3.js

/*
 *  The MurmurHash3 algorithm was created by Austin Appleby.  This JavaScript port was authored
 *  by whitequark (based on Java port by Yonik Seeley) and is placed into the public domain.
 *  The author hereby disclaims copyright to this source code.
 *
 *  This produces exactly the same hash values as the final C++ version of MurmurHash3 and
 *  is thus suitable for producing the same hash values across platforms.
 *
 *  There are two versions of this hash implementation. First interprets the string as a
 *  sequence of bytes, ignoring most significant byte of each codepoint. The second one
 *  interprets the string as a UTF-16 codepoint sequence, and appends each 16-bit codepoint
 *  to the hash independently. The latter mode was not written to be compatible with
 *  any other implementation, but it should offer better performance for JavaScript-only
 *  applications.
 *
 *  See http://github.com/whitequark/murmurhash3-js for future updates to this file.
 */

var MurmurHash3 = {
    mul32: function(m, n) {
        var nlo = n & 0xffff;
        var nhi = n - nlo;
        return ((nhi * m | 0) + (nlo * m | 0)) | 0;
    },

    hashBytes: function(data, len, seed) {
        var c1 = 0xcc9e2d51, c2 = 0x1b873593;

        var h1 = seed;
        var roundedEnd = len & ~0x3;

        for (var i = 0; i < roundedEnd; i += 4) {
            var k1 = (data.charCodeAt(i)     & 0xff)        |
                ((data.charCodeAt(i + 1) & 0xff) << 8)  |
                ((data.charCodeAt(i + 2) & 0xff) << 16) |
                ((data.charCodeAt(i + 3) & 0xff) << 24);

            k1 = this.mul32(k1, c1);
            k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17);  // ROTL32(k1,15);
            k1 = this.mul32(k1, c2);

            h1 ^= k1;
            h1 = ((h1 & 0x7ffff) << 13) | (h1 >>> 19);  // ROTL32(h1,13);
            h1 = (h1 * 5 + 0xe6546b64) | 0;
        }

        k1 = 0;

        switch(len % 4) {
            case 3:
                k1 = (data.charCodeAt(roundedEnd + 2) & 0xff) << 16;
                // fallthrough
            case 2:
                k1 |= (data.charCodeAt(roundedEnd + 1) & 0xff) << 8;
                // fallthrough
            case 1:
                k1 |= (data.charCodeAt(roundedEnd) & 0xff);
                k1 = this.mul32(k1, c1);
                k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17);  // ROTL32(k1,15);
                k1 = this.mul32(k1, c2);
                h1 ^= k1;
        }

        // finalization
        h1 ^= len;

        // fmix(h1);
        h1 ^= h1 >>> 16;
        h1  = this.mul32(h1, 0x85ebca6b);
        h1 ^= h1 >>> 13;
        h1  = this.mul32(h1, 0xc2b2ae35);
        h1 ^= h1 >>> 16;

        return h1;
    },

    hashString: function(data, len, seed) {
        var c1 = 0xcc9e2d51, c2 = 0x1b873593;

        var h1 = seed;
        var roundedEnd = len & ~0x1;

        for (var i = 0; i < roundedEnd; i += 2) {
            var k1 = data.charCodeAt(i) | (data.charCodeAt(i + 1) << 16);

            k1 = this.mul32(k1, c1);
            k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17);  // ROTL32(k1,15);
            k1 = this.mul32(k1, c2);

            h1 ^= k1;
            h1 = ((h1 & 0x7ffff) << 13) | (h1 >>> 19);  // ROTL32(h1,13);
            h1 = (h1 * 5 + 0xe6546b64) | 0;
        }

        if((len % 2) == 1) {
            k1 = data.charCodeAt(roundedEnd);
            k1 = this.mul32(k1, c1);
            k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17);  // ROTL32(k1,15);
            k1 = this.mul32(k1, c2);
            h1 ^= k1;
        }

        // finalization
        h1 ^= (len << 1);

        // fmix(h1);
        h1 ^= h1 >>> 16;
        h1  = this.mul32(h1, 0x85ebca6b);
        h1 ^= h1 >>> 13;
        h1  = this.mul32(h1, 0xc2b2ae35);
        h1 ^= h1 >>> 16;

        return h1;
    }
};

if(typeof module !== "undefined" && typeof module.exports !== "undefined") {
    module.exports = MurmurHash3;
}

这是 HTML 代码 + Javascript 我用来测试 javascript

https://jsbin.com/gicomikike/edit?html,js,output

<html>
<head>

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.0/jquery.min.js"></script>
<script src="murmurhash3.js"></script>

<script>

function call_murmurhash3_32_gc () {
    var key = $('textarea#textarea1').val();
    var seed = 0;
    var hash = MurmurHash3.hashString (key, key.length, seed);
    $('div#div1').text(hash);
    }
</script>

</head>
<body>
Body

<form>

    <textarea rows="4" cols="50" id=textarea1></textarea>

    <br>
    <input type="button" value="Hash" onclick="call_murmurhash3_32_gc()"/>

</form>


<div id=div1>

</div>

</body>
</html>

simple_python_wrapper.py

这在 sklearn 的导入树中使用了 MurmurHash3.cpp。

from sklearn.feature_extraction._hashing import transform 
import numpy as np

def getHashIndex (words):

    raw_X = words
    n_features = 262144 # 2 ** 18
    dtype = np.float32 #np.float64

    #transform(raw_X, Py_ssize_t n_features, dtype)
    indices_a, indptr, values = transform (raw_X, n_features, dtype)

    return indices_a 


words = [[("hello", 1), ("there", 1)]] 

print getHashIndex (words)

输出

[260679  45525]

MurmurHash3.cpp

I copied this code is available here 
https://github.com/karanlyons/murmurHash3.js/blob/master/murmurHash3.js
//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.

// Note - The x86 and x64 versions do _not_ produce the same results, as the
// algorithms are optimized for their respective platforms. You can still
// compile and run any of them on any platform, but your performance with the
// non-native version will be less than optimal.

#include "MurmurHash3.h"

//-----------------------------------------------------------------------------
// Platform-specific functions and macros

// Microsoft Visual Studio

#if defined(_MSC_VER)

#define FORCE_INLINE    __forceinline

#include <stdlib.h>

#define ROTL32(x,y) _rotl(x,y)
#define ROTL64(x,y) _rotl64(x,y)

#define BIG_CONSTANT(x) (x)

// Other compilers

#else   // defined(_MSC_VER)

#if defined(GNUC) && ((GNUC > 4) || (GNUC == 4 && GNUC_MINOR >= 4))

/* gcc version >= 4.4 4.1 = RHEL 5, 4.4 = RHEL 6.
 * Don't inline for RHEL 5 gcc which is 4.1 */
#define FORCE_INLINE attribute((always_inline))

#else

#define FORCE_INLINE

#endif


inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

inline uint64_t rotl64 ( uint64_t x, int8_t r )
{
  return (x << r) | (x >> (64 - r));
}

#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)

#define BIG_CONSTANT(x) (x##LLU)

#endif // !defined(_MSC_VER)

//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here

FORCE_INLINE uint32_t getblock ( const uint32_t * p, int i )
{
  return p[i];
}

FORCE_INLINE uint64_t getblock ( const uint64_t * p, int i )
{
  return p[i];
}

//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche

FORCE_INLINE uint32_t fmix ( uint32_t h )
{
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}

//----------

FORCE_INLINE uint64_t fmix ( uint64_t k )
{
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xff51afd7ed558ccd);
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
  k ^= k >> 33;

  return k;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x86_32 ( const void * key, int len,
                          uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 4;

  uint32_t h1 = seed;

  uint32_t c1 = 0xcc9e2d51;
  uint32_t c2 = 0x1b873593;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);

  for(int i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i);

    k1 *= c1;
    k1 = ROTL32(k1,15);
    k1 *= c2;

    h1 ^= k1;
    h1 = ROTL32(h1,13);
    h1 = h1*5+0xe6546b64;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*4);

  uint32_t k1 = 0;

  switch(len & 3)
  {
  case 3: k1 ^= tail[2] << 16;
  case 2: k1 ^= tail[1] << 8;
  case 1: k1 ^= tail[0];
          k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len;

  h1 = fmix(h1);

  *(uint32_t*)out = h1;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x86_128 ( const void * key, const int len,
                           uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;

  uint32_t h1 = seed;
  uint32_t h2 = seed;
  uint32_t h3 = seed;
  uint32_t h4 = seed;

  uint32_t c1 = 0x239b961b;
  uint32_t c2 = 0xab0e9789;
  uint32_t c3 = 0x38b34ae5;
  uint32_t c4 = 0xa1e38b93;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);

  for(int i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i*4+0);
    uint32_t k2 = getblock(blocks,i*4+1);
    uint32_t k3 = getblock(blocks,i*4+2);
    uint32_t k4 = getblock(blocks,i*4+3);

    k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;

    h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;

    k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

    h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;

    k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

    h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;

    k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

    h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint32_t k1 = 0;
  uint32_t k2 = 0;
  uint32_t k3 = 0;
  uint32_t k4 = 0;

  switch(len & 15)
  {
  case 15: k4 ^= tail[14] << 16;
  case 14: k4 ^= tail[13] << 8;
  case 13: k4 ^= tail[12] << 0;
           k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

  case 12: k3 ^= tail[11] << 24;
  case 11: k3 ^= tail[10] << 16;
  case 10: k3 ^= tail[ 9] << 8;
  case  9: k3 ^= tail[ 8] << 0;
           k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

  case  8: k2 ^= tail[ 7] << 24;
  case  7: k2 ^= tail[ 6] << 16;
  case  6: k2 ^= tail[ 5] << 8;
  case  5: k2 ^= tail[ 4] << 0;
           k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

  case  4: k1 ^= tail[ 3] << 24;
  case  3: k1 ^= tail[ 2] << 16;
  case  2: k1 ^= tail[ 1] << 8;
  case  1: k1 ^= tail[ 0] << 0;
           k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  h1 = fmix(h1);
  h2 = fmix(h2);
  h3 = fmix(h3);
  h4 = fmix(h4);

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  ((uint32_t*)out)[0] = h1;
  ((uint32_t*)out)[1] = h2;
  ((uint32_t*)out)[2] = h3;
  ((uint32_t*)out)[3] = h4;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x64_128 ( const void * key, const int len,
                           const uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;

  uint64_t h1 = seed;
  uint64_t h2 = seed;

  uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);

  //----------
  // body

  const uint64_t * blocks = (const uint64_t *)(data);

  for(int i = 0; i < nblocks; i++)
  {
    uint64_t k1 = getblock(blocks,i*2+0);
    uint64_t k2 = getblock(blocks,i*2+1);

    k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;

    h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;

    k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

    h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint64_t k1 = 0;
  uint64_t k2 = 0;

  switch(len & 15)
  {
  case 15: k2 ^= uint64_t(tail[14]) << 48;
  case 14: k2 ^= uint64_t(tail[13]) << 40;
  case 13: k2 ^= uint64_t(tail[12]) << 32;
  case 12: k2 ^= uint64_t(tail[11]) << 24;
  case 11: k2 ^= uint64_t(tail[10]) << 16;
  case 10: k2 ^= uint64_t(tail[ 9]) << 8;
  case  9: k2 ^= uint64_t(tail[ 8]) << 0;
           k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

  case  8: k1 ^= uint64_t(tail[ 7]) << 56;
  case  7: k1 ^= uint64_t(tail[ 6]) << 48;
  case  6: k1 ^= uint64_t(tail[ 5]) << 40;
  case  5: k1 ^= uint64_t(tail[ 4]) << 32;
  case  4: k1 ^= uint64_t(tail[ 3]) << 24;
  case  3: k1 ^= uint64_t(tail[ 2]) << 16;
  case  2: k1 ^= uint64_t(tail[ 1]) << 8;
  case  1: k1 ^= uint64_t(tail[ 0]) << 0;
           k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len;

  h1 += h2;
  h2 += h1;

  h1 = fmix(h1);
  h2 = fmix(h2);

  h1 += h2;
  h2 += h1;

  ((uint64_t*)out)[0] = h1;
  ((uint64_t*)out)[1] = h2;
}

//-----------------------------------------------------------------------------

让我再解释一下。

from sklearn.feature_extraction._hashing import transform 使用此代码 https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_extraction/_hashing.pyx 它利用了这个 from sklearn.utils.murmurhash cimport murmurhash3_bytes_s32 反过来又利用了这个 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/murmurhash.pyx 这是建立在这个 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/src/MurmurHash3.cpp所以,MurmurHash3.cpp很重要。我需要这个 MurmurHash3.cpp 的 Javascript 版本,这样 Javascript 代码和 MurmurHash3.cpp 会产生相同的结果 .

我需要这个,因为我想让我的一些机器学习工具在线可用,并且哈希需要在客户端的网络浏览器上完成。

到目前为止,我已经找到了几个 Javascript MurmurHash3 的实现。但是,murmurhash3.js https://github.com/whitequark/murmurhash3-js/blob/master/murmurhash3.js 似乎(就代码结构而言)最接近 sklearn 使用的 MurmurHash3.cpp。但是我仍然没有从他们两个那里得到相同的哈希值。

任何人都可以指导我如何修改 murmurhash3.js(上文),以便它生成与 MurmurHash3.cpp(上文)相同的哈希值吗?

根据@ChristopherOicles 的建议,我更改了我的 Javascript 代码(我的 header 关闭了我的 HTML 代码)以使用 hashBytes 而不是 hashString 如下所示。我还注意到,出于我的目的,我需要将 hashBytes 的返回值更改为其绝对值(我这样做了)。这些解决了我的问题,现在我从 Python/C++ 代码和 Javascript 代码中得到了相同的散列值。

修改了我的 HTML 文件中的 Javascript 函数

<script>
function call_murmurhash3_32_gc () {
    var key = $('textarea#textarea1').val();
    var seed = 0;
    var hash = MurmurHash3.hashBytes (key, key.length, seed);
    $('div#div1').text(Math.abs (hash) % 262144);
    }
</script>

我的完整解决方案在这里https://jsbin.com/qilokot/edit?html,js,output .

再次感谢 Christopher Oicles 以及所有试图以某种方式帮助我的人。

来晚了,现在才遇到这个问题

如果字符串不是由常规 ASCII 字符组成,您使用的实现将产生与参考实现不同的结果。这是因为它 uses charCodeAt and a byte mask 从输入字符串中获取需要散列的字节。如果字符串包含任何其他字符,除非您在其他平台上执行完全相同的操作来解码字符串,否则结果会有所不同。

我制作了 a fork of MurmurHash3js,它使用字节而不是字符串作为输入。以下是您如何使用它:

npm install murmurhash3js-revisited

然后在你的js文件中你可以做:

import MurmurHash3 from 'murmurhash3js-revisited';

const str = "My hovercraft is full of eels.";
// get utf-8 bytes
const bytes = new TextEncoder().encode(str);

MurmurHash3.x86.hash32(bytes);
// output: 2953494853

MurmurHash3.x86.hash128(bytes);
// output: "e3a186aee169ba6c6a8bd9343c68fa9c"

MurmurHash3.x64.hash128(bytes);
// output: "03e5e14d358c16d1e5ae86df7ed5cfcb"

MurmurHash3.x86.hash32("any string");
// output: undefined
// (x86.hash128 and x64.hash128 also return undefined)

您可以阅读有关性能注意事项的信息,并尝试在 my library's documentation 中的不同 JavaScript 实现之间进行交互式比较。