为去重点集生成 numpy 索引数组

Question

我有一个至少有 10 万个点（最多 30 亿个）的数组，其中一些是重复的。我想对点进行去重并生成一个索引数组，该数组保留重复点的原始序列。

例如：

x = [(0, 0),  # (x1, y1)
     (1, 0),  # (x2, y2)
     (1, 1),  # (x3, y3)
     (0, 0)]  # (x4, y4)

去重 x，我们有 y：

y = list(set(x)) = [(1, 0),  # (x2, y2)
                    (0, 0),  # (x1, y1) and (x4, y4)
                    (1, 1)]  # (x3, y3)

然后我们将得到一个结果索引数组，z:

z = [1,  # (x1, y1) 
     0,  # (x2, y2)
     2,  # (x3, y3)
     1]  # (x4, y4)

有没有类似numpy的方式获取z？这是一个蛮力实现：

z = []
for each_point in x:
    index = y.index(each_point)
    z.append(index)

Answer 1

x2 = np.ascontiguousarray(x).view(np.dtype((np.void, x.dtype.itemsize * x.shape[1])))
y_temp, z = np.unique(x2, return_inverse=True)
y = y_temp.view(dtype='int64').reshape(len(y_temp), 2)
print(y)
print(z)

产量

[[0 0]
 [1 0]
 [1 1]]

和

[0 1 2 0]

来源：Find unique rows in numpy.array

Answer 2

这个问题可以使用 numpy_indexed 包优雅地解决（免责声明：我是它的作者）。它类似于 Alex 在幕后发布的解决方案；但有更好的界面和更多的测试：

import numpy_indexed as npi
y, z = npi.unique(x, return_inverse=True)

为去重点集生成 numpy 索引数组

Generating numpy array of indices for a deduplicated set of points

python

arrays

numpy

deduplication