与两个 numpy 数组相交并对其进行排序的索引
Indices that intersect and sort two numpy arrays
我有两个numpy整数数组,长度都是几亿。在每个数组中,值都是唯一的,并且每个值最初都是未排序的。
我想要每个产生排序交集的索引。例如:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
然后这些的排序交集是[4, 5, 11]
,所以我们想要将每个 x 和 y 变成那个数组的索引,所以我们希望它是 return:
mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])
从那以后x[mx] == y[my] == np.intersect1d(x, y)
到目前为止我们唯一的解决方案涉及三个不同的 argsorts,因此看起来不太可能是最优的。
每个值代表一个星系,以防让问题变得更有趣。
也许使用字典的纯 Python 解决方案适合您:
def indices_from_values(a, intersect):
idx = {value: index for index, value in enumerate(a)}
return np.array([idx[x] for x in intersect])
intersect = np.intersect1d(x, y)
mx = indices_from_values(x, intersect)
my = indices_from_values(y, intersect)
np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))
对于纯 numpy 解决方案,您可以这样做:
用np.unique
分别得到x
和y
中的唯一值和对应的索引:
# sorted unique values in x and y and the indices corresponding to their first
# occurrences, such that u_x == x[u_idx_x]
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
使用np.intersect1d
求唯一值的交集:
# we can assume_unique, which can be faster for large arrays
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
最后,使用 np.in1d
到 select 仅对应于 x
或 y
中的唯一值的索引也恰好在x
和 y
的交集:
# it is also safe to assume_unique here
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
将所有这些整合到一个函数中:
def intersect_indices(x, y):
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
return i_idx_x, i_idx_y
例如:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
i_idx_x, i_idx_y = intersect_indices(x, y)
print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))
速度测试:
In [1]: k = 1000000
In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....:
1 loops, best of 3: 597 ms per loop
更新:
我最初忽略了一个事实,即在您的情况下 x
和 y
都只包含唯一值。考虑到这一点,使用间接排序可能会做得更好:
def intersect_indices_unique(x, y):
u_idx_x = np.argsort(x)
u_idx_y = np.argsort(y)
i_xy = np.intersect1d(x, y, assume_unique=True)
i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
return i_idx_x, i_idx_y
这是一个更真实的测试用例,其中 x
和 y
都包含唯一(但部分重叠)的值:
In [1]: n, k = 10000000, 1000000
In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....:
1 loops, best of 3: 593 ms per loop
In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....:
1 loops, best of 3: 453 ms per loop
@Divakar的解决方案在性能方面非常相似:
In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....:
1 loops, best of 3: 472 ms per loop
这是一个基于 intersect1d
实现的选项,非常简单。它需要一次调用 argsort
.
公认的简单测试通过了。
import numpy as np
def my_intersect(x, y):
"""my_intersect(x, y) -> xm, ym
x, y: 1-d arrays of unique values
xm, ym: indices into x and y giving sorted intersection
"""
# basic idea taken from numpy.lib.arraysetops.intersect1d
aux = np.concatenate((x, y))
sidx = aux.argsort()
# Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
inidx = aux[sidx[1:]] == aux[sidx[:-1]]
# quicksort is not stable, so must do some work to extract indices
# (if stable, sidx[inidx.nonzero()] would be for x)
# interlace the two sets of indices, and check against lengths
xym = np.vstack((sidx[inidx.nonzero()],
sidx[1:][inidx.nonzero()])).T.flatten()
xm = xym[xym < len(x)]
ym = xym[xym >= len(x)] - len(x)
return xm, ym
def check_my_intersect(x, y):
mx, my = my_intersect(x, y)
assert (x[mx] == np.intersect1d(x, y)).all()
# not really necessary: np.intersect1d returns a sorted list
assert (x[mx] == sorted(x[mx])).all()
assert (x[mx] == y[my]).all()
def random_unique_unsorted(n):
while True:
x = np.unique(np.random.randint(2*n, size=n))
if len(x):
break
np.random.shuffle(x)
return x
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
check_my_intersect(x, y)
for i in range(20):
x = random_unique_unsorted(100+i)
y = random_unique_unsorted(200+i)
check_my_intersect(x, y)
编辑:"Note" 评论令人困惑(使用 ...
作为语音省略号,忘了它也是一个 Python 运算符)。
您也可以使用 np.searchsorted
,像这样 -
def searchsorted_based(x,y):
# Get argsort for both x and y
xsort_idx = x.argsort()
ysort_idx = y.argsort()
# Sort x and y and store them
X = x[xsort_idx]
Y = y[ysort_idx]
# Find positions of Y in X and the matches by the positions that
# shift between 'left' and 'right' based searches.
# Use the matches posotions to get corresponding argsort for X.
x1 = np.searchsorted(X,Y,'left')
x2 = np.searchsorted(X,Y,'right')
out1 = xsort_idx[x1[x2 != x1]]
# Repeat for X in Y findings
y1 = np.searchsorted(Y,X,'left')
y2 = np.searchsorted(Y,X,'right')
out2 = ysort_idx[y1[y2 != y1]]
return out1, out2
样本运行-
In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
...: y = np.array([20, 5, 4, 9, 11, 7, 25])
...:
In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))
我有两个numpy整数数组,长度都是几亿。在每个数组中,值都是唯一的,并且每个值最初都是未排序的。
我想要每个产生排序交集的索引。例如:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
然后这些的排序交集是[4, 5, 11]
,所以我们想要将每个 x 和 y 变成那个数组的索引,所以我们希望它是 return:
mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])
从那以后x[mx] == y[my] == np.intersect1d(x, y)
到目前为止我们唯一的解决方案涉及三个不同的 argsorts,因此看起来不太可能是最优的。
每个值代表一个星系,以防让问题变得更有趣。
也许使用字典的纯 Python 解决方案适合您:
def indices_from_values(a, intersect):
idx = {value: index for index, value in enumerate(a)}
return np.array([idx[x] for x in intersect])
intersect = np.intersect1d(x, y)
mx = indices_from_values(x, intersect)
my = indices_from_values(y, intersect)
np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))
对于纯 numpy 解决方案,您可以这样做:
用
np.unique
分别得到x
和y
中的唯一值和对应的索引:# sorted unique values in x and y and the indices corresponding to their first # occurrences, such that u_x == x[u_idx_x] u_x, u_idx_x = np.unique(x, return_index=True) u_y, u_idx_y = np.unique(y, return_index=True)
使用
np.intersect1d
求唯一值的交集:# we can assume_unique, which can be faster for large arrays i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
最后,使用
np.in1d
到 select 仅对应于x
或y
中的唯一值的索引也恰好在x
和y
的交集:# it is also safe to assume_unique here i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)] i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
将所有这些整合到一个函数中:
def intersect_indices(x, y):
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
return i_idx_x, i_idx_y
例如:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
i_idx_x, i_idx_y = intersect_indices(x, y)
print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))
速度测试:
In [1]: k = 1000000
In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....:
1 loops, best of 3: 597 ms per loop
更新:
我最初忽略了一个事实,即在您的情况下 x
和 y
都只包含唯一值。考虑到这一点,使用间接排序可能会做得更好:
def intersect_indices_unique(x, y):
u_idx_x = np.argsort(x)
u_idx_y = np.argsort(y)
i_xy = np.intersect1d(x, y, assume_unique=True)
i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
return i_idx_x, i_idx_y
这是一个更真实的测试用例,其中 x
和 y
都包含唯一(但部分重叠)的值:
In [1]: n, k = 10000000, 1000000
In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....:
1 loops, best of 3: 593 ms per loop
In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....:
1 loops, best of 3: 453 ms per loop
@Divakar的解决方案在性能方面非常相似:
In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....:
1 loops, best of 3: 472 ms per loop
这是一个基于 intersect1d
实现的选项,非常简单。它需要一次调用 argsort
.
公认的简单测试通过了。
import numpy as np
def my_intersect(x, y):
"""my_intersect(x, y) -> xm, ym
x, y: 1-d arrays of unique values
xm, ym: indices into x and y giving sorted intersection
"""
# basic idea taken from numpy.lib.arraysetops.intersect1d
aux = np.concatenate((x, y))
sidx = aux.argsort()
# Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
inidx = aux[sidx[1:]] == aux[sidx[:-1]]
# quicksort is not stable, so must do some work to extract indices
# (if stable, sidx[inidx.nonzero()] would be for x)
# interlace the two sets of indices, and check against lengths
xym = np.vstack((sidx[inidx.nonzero()],
sidx[1:][inidx.nonzero()])).T.flatten()
xm = xym[xym < len(x)]
ym = xym[xym >= len(x)] - len(x)
return xm, ym
def check_my_intersect(x, y):
mx, my = my_intersect(x, y)
assert (x[mx] == np.intersect1d(x, y)).all()
# not really necessary: np.intersect1d returns a sorted list
assert (x[mx] == sorted(x[mx])).all()
assert (x[mx] == y[my]).all()
def random_unique_unsorted(n):
while True:
x = np.unique(np.random.randint(2*n, size=n))
if len(x):
break
np.random.shuffle(x)
return x
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
check_my_intersect(x, y)
for i in range(20):
x = random_unique_unsorted(100+i)
y = random_unique_unsorted(200+i)
check_my_intersect(x, y)
编辑:"Note" 评论令人困惑(使用 ...
作为语音省略号,忘了它也是一个 Python 运算符)。
您也可以使用 np.searchsorted
,像这样 -
def searchsorted_based(x,y):
# Get argsort for both x and y
xsort_idx = x.argsort()
ysort_idx = y.argsort()
# Sort x and y and store them
X = x[xsort_idx]
Y = y[ysort_idx]
# Find positions of Y in X and the matches by the positions that
# shift between 'left' and 'right' based searches.
# Use the matches posotions to get corresponding argsort for X.
x1 = np.searchsorted(X,Y,'left')
x2 = np.searchsorted(X,Y,'right')
out1 = xsort_idx[x1[x2 != x1]]
# Repeat for X in Y findings
y1 = np.searchsorted(Y,X,'left')
y2 = np.searchsorted(Y,X,'right')
out2 = ysort_idx[y1[y2 != y1]]
return out1, out2
样本运行-
In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
...: y = np.array([20, 5, 4, 9, 11, 7, 25])
...:
In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))