FAST：1D 与 2D 中的行重叠？

Question

假设我有二维数组，f.e.:

In [136]: ary          
array([[6, 7, 9],
       [0, 2, 5],
       [3, 3, 4],
       [2, 2, 8],
       [3, 4, 9],
       [0, 5, 7],
       [2, 4, 9],
       [3, 5, 7],
       [7, 8, 8],
       [0, 2, 3]])

我想计算与 1D 向量的重叠，FAST。

我几乎可以做到（大阵列上 8 毫秒）：

 (ary == match)     # .sum(axis=1).argsort()[::-1]

它的问题是它只有在 Position 和 Value 都匹配时才匹配。

match == [6,5,4]                                                                                                                                                      
 
array([[ True, False, False],
       [False, False, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False,  True, False],
       [False, False, False],
       [False,  True, False],
       [False, False, False],
       [False, False, False]])

F.e。 1d vec 的第 2 列中的 5 与第 2 行第 3 列中的 5 不匹配。

它适用于 .isin()

  np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]

但在大数组上速度很慢 (200000,10) ~20ms

帮我扩展第一种情况，让它可以匹配一维向量任意位置的值与行。

预期结果是按 OVERLAP COUNT 排序的行索引，让我们使用 [2,4,5] 因为它有更多匹配项：

In [147]: np.isin(ary,[2,5,4],assume_unique=True)                                                                                                                             
Out[147]: 
array([[False, False, False],
       [False,  True,  True],
       [False, False,  True],
       [ True,  True, False],
       [False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False,  True, False],
       [False, False, False],
       [False,  True, False]])

重叠：

In [149]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1)                                                                                                                 
Out[149]: array([0, 2, 1, 2, 1, 1, 2, 1, 0, 1])

重叠顺序：

In [148]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1).argsort()[::-1]                                                                                                 
Out[148]: array([6, 3, 1, 9, 7, 5, 4, 2, 8, 0])

查看行：6、3、1 有 Overlap:2 这就是为什么它们排在第一位的原因

变体：

#could be from 1000,10,10 to 2000,100,20 .. ++ 
def data(cells=2000,seg=100,items=10): 
    ary = np.random.randint(0,cells,(cells*seg,items))
    rnd = np.random.randint(0,cells*seg)
    return ary, ary[rnd]

def best2(match,ary): #~20ms, (200000,10)
    return np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]                                                                                                     

def best3(match,ary): #Corralien  ~20ms
    return np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0).reshape(ary.shape).sum(axis=1).argsort()[::-1][:5]

如果在 GPU 上使用 numba+cuda 或 cupy 可以加速吗？

Answer 1

使用广播和np.logical_or.reduce:

# match = np.array(match)
>>> np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0) \
                 .reshape(ary.shape)

array([[ True, False, False],
       [False, False,  True],
       [False, False,  True],
       [False, False, False],
       [False,  True, False],
       [False,  True, False],
       [False,  True, False],
       [False,  True, False],
       [False, False, False],
       [False, False, False]])

性能

match = np.array([6, 5, 4])
ary = np.random.randint(0, 10, (200000, 10))

%timeit np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0).reshape(ary.shape)
7.49 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 2

所有如此快速的方法的主要问题是它们创建了巨大的临时数组，而最终只有 5 个项目是重要的。 Numba 可用于动态计算数组（使用高效的 JIT 编译循环），避免一些临时数组。此外，不需要完全排序，因为只需要检索前 5 项。可以改用 partition。甚至可以使用更快的方法，因为只有 5 个选定的项目很重要，而不是其他项目。这是结果代码：

@nb.njit('int32[::1](int32[::1], int32[:,::1])')
def computeScore(match, ary):
    n, m = ary.shape
    assert m == match.shape[0]
    tmp = np.empty(n, dtype=np.int32)
    for i in range(n):
        s = 0
        # Count the number of matching items (with repetition)
        for j in range(m):
            # Find a match
            item = ary[i, j]
            found = False
            for k in range(m):
                found |= item == match[k]
            s += found
        tmp[i] = s
    return tmp

def best4(match, ary):
    n, m = ary.shape
    score = computeScore(match, ary)
    bestItems = np.argpartition(score, n-5)[n-5:] # sadly not supported by Numba yet
    order = np.argsort(-score[bestItems]) # bastItems is not sorted and likely needs to be
    return bestItems[order]

请注意，当多个项目之间的匹配分数（存储在 tmp 中）相等时，best4 可以提供不同于 best2 的结果。这是由于 排序算法在 Numpy 中默认不稳定（kind 参数可用于调整此行为）。分区算法也是如此，尽管Numpy似乎还没有提供稳定的分区算法。

此代码应该比其他实现更快，但幅度不会很大。问题之一是 Numba（以及大多数 C/C++ 编译器，例如用于编译 Numpy 的编译器）无法成功生成快速代码，因为它 不知道值 m 在编译时。因此，最激进的优化（例如，展开循环和使用 SIMD 指令）几乎无法应用。您可以使用 断言或转义条件 .

来帮助 Numba

此外，代码可以使用多线程并行化，使其在主流平台上运行速度更快。请注意，并行化版本可能不会在小数据或所有平台上更快，因为创建线程会引入可能比实际计算更大的开销。

这是最终的实现：

@nb.njit('int32[::1](int32[::1], int32[:,::1])', parallel=True)
def computeScoreOpt(match, ary):
    n, m = ary.shape
    assert m == match.shape[0]
    assert m == 10
    tmp = np.empty(n, dtype=np.int32)
    for i in nb.prange(n):
        # Thie enable Numba to assume m=10 in the following code
        # and generate a very efficient code for this specific case.
        # The assert should be enough but the internals of Numba 
        # prevent the information to be propagatted to this portion
        # of the code when it is parallelized.
        if m != 10: continue

        s = 0
        for j in range(m):
            item = ary[i, j]
            found = False
            for k in range(m):
                found |= item == match[k]
            s += found
        tmp[i] = s
    return tmp

def best5(match, ary):
    n, m = ary.shape
    score = computeScoreOpt(match, ary)
    bestItems = np.argpartition(score, n-5)[n-5:]
    order = np.argsort(-score[bestItems])
    return bestItems[order]

这是我的机器上的示例数据集的时间：

best2:                            18.2 ms
best3:                            17.8 ms
best4 (sequential -- default):    12.0 ms
best4 (parallel):                  3.1 ms
best5 (sequential):                3.2 ms
best5 (parallel -- default):       1.2 ms

最快的实现比原始参考实现快 15 倍。

注意，如果m大于30左右，最好使用更高级的基于集合的算法。在这种情况下，另一种解决方案是先对 match 进行排序，然后在基于 i 的循环中使用 np.isin。

FAST：1D 与 2D 中的行重叠？

FAST: 1D overlaps with rows in 2D?

python

arrays

performance

numpy

overlap