Numpy 根据列表折叠列
Numpy collapse columns according to list
在 NumPy
中,我有一个 d x n
数组 A
和一个长度为 n
的列表 L
,描述了我希望 A
以矩阵 B
结束。这个想法是矩阵 B
的列 i
是 A
的所有列的 sum
,L
中的对应值为 i
.
我可以用 for
循环来做到这一点:
A = np.arange(15).reshape(3,5)
L = [0,1,2,1,1]
n_cols = 3
B = np.zeros((len(A), n_cols))
# assume I know the desired number of columns,
# which is also one more than the maximum value of L
for i, a in enumerate(A.T):
B[:, L[i]] += a
我想知道是否有办法通过切片数组 A
(或以其他矢量化方式)来做到这一点?
``B` 上的这次迭代是否相同(未测试)?
for I in range(B.shape[1]):
B[:, I] = A[:, L==I].sum(axis=1)
循环次数会少一些。但更重要的是,它可能会提供其他矢量化见解。
(edit) 在测试中,这有效,但速度要慢得多。
+========
scipy.sparse
用一个矩阵对矩阵乘积进行列求和。我们可以在这里做同样的事情吗?制作 C
数组,右列为 1
def my2(A,L):
n_cols = L.shape[0]
C = np.zeros((n_cols,n_cols),int)
C[np.arange(n_cols), L] = 1
return A.dot(C)
这比您的循环快 7 倍,比 @Divakars
reduceat
代码快一点。
==========
In [126]: timeit vectorized_app(A,L)
The slowest run took 8.16 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 99.7 µs per loop
In [127]: timeit val2 = my2(A,L)
The slowest run took 10.46 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 81.6 µs per loop
In [128]: timeit org1(A,n_cols)
1000 loops, best of 3: 623 µs per loop
In [129]: d,n_cols
Out[129]: (10, 100)
您使用 L
选择那些列 sum-reducing/collapsing A
的列。此外,您正在根据 L
元素的唯一性更新输出数组的列。
因此,您可以使用 np.add.reduceat
作为矢量化解决方案,就像这样 -
sidx = L.argsort()
col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
B_out = np.zeros((len(A), n_cols))
B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)
运行时测试 -
In [129]: def org_app(A,n_cols):
...: B = np.zeros((len(A), n_cols))
...: for i, a in enumerate(A.T):
...: B[:, L[i]] += a
...: return B
...:
...: def vectorized_app(A,n_cols):
...: sidx = L.argsort()
...: col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
...: B_out = np.zeros((len(A), n_cols))
...: B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)
...: return B_out
...:
In [130]: # Setup inputs with an appreciable no. of cols & lesser rows
...: # so as that memory bandwidth to work with huge number of
...: # row elems doesn't become the bottleneck
...: d,n_cols = 10,5000
...: A = np.random.rand(d,n_cols)
...: L = np.random.randint(0,n_cols,(n_cols,))
...:
In [131]: np.allclose(org_app(A,n_cols),vectorized_app(A,n_cols))
Out[131]: True
In [132]: %timeit org_app(A,n_cols)
10 loops, best of 3: 33.3 ms per loop
In [133]: %timeit vectorized_app(A,n_cols)
100 loops, best of 3: 1.87 ms per loop
随着行数与 A
中的列数相当,矢量化方法的高内存带宽要求将抵消它的任何明显加速。
在 NumPy
中,我有一个 d x n
数组 A
和一个长度为 n
的列表 L
,描述了我希望 A
以矩阵 B
结束。这个想法是矩阵 B
的列 i
是 A
的所有列的 sum
,L
中的对应值为 i
.
我可以用 for
循环来做到这一点:
A = np.arange(15).reshape(3,5)
L = [0,1,2,1,1]
n_cols = 3
B = np.zeros((len(A), n_cols))
# assume I know the desired number of columns,
# which is also one more than the maximum value of L
for i, a in enumerate(A.T):
B[:, L[i]] += a
我想知道是否有办法通过切片数组 A
(或以其他矢量化方式)来做到这一点?
``B` 上的这次迭代是否相同(未测试)?
for I in range(B.shape[1]):
B[:, I] = A[:, L==I].sum(axis=1)
循环次数会少一些。但更重要的是,它可能会提供其他矢量化见解。
(edit) 在测试中,这有效,但速度要慢得多。
+========
scipy.sparse
用一个矩阵对矩阵乘积进行列求和。我们可以在这里做同样的事情吗?制作 C
数组,右列为 1
def my2(A,L):
n_cols = L.shape[0]
C = np.zeros((n_cols,n_cols),int)
C[np.arange(n_cols), L] = 1
return A.dot(C)
这比您的循环快 7 倍,比 @Divakars
reduceat
代码快一点。
==========
In [126]: timeit vectorized_app(A,L)
The slowest run took 8.16 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 99.7 µs per loop
In [127]: timeit val2 = my2(A,L)
The slowest run took 10.46 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 81.6 µs per loop
In [128]: timeit org1(A,n_cols)
1000 loops, best of 3: 623 µs per loop
In [129]: d,n_cols
Out[129]: (10, 100)
您使用 L
选择那些列 sum-reducing/collapsing A
的列。此外,您正在根据 L
元素的唯一性更新输出数组的列。
因此,您可以使用 np.add.reduceat
作为矢量化解决方案,就像这样 -
sidx = L.argsort()
col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
B_out = np.zeros((len(A), n_cols))
B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)
运行时测试 -
In [129]: def org_app(A,n_cols):
...: B = np.zeros((len(A), n_cols))
...: for i, a in enumerate(A.T):
...: B[:, L[i]] += a
...: return B
...:
...: def vectorized_app(A,n_cols):
...: sidx = L.argsort()
...: col_idx, grp_start_idx = np.unique(L[sidx],return_index=True)
...: B_out = np.zeros((len(A), n_cols))
...: B_out[:,col_idx] = np.add.reduceat(A[:,sidx],grp_start_idx,axis=1)
...: return B_out
...:
In [130]: # Setup inputs with an appreciable no. of cols & lesser rows
...: # so as that memory bandwidth to work with huge number of
...: # row elems doesn't become the bottleneck
...: d,n_cols = 10,5000
...: A = np.random.rand(d,n_cols)
...: L = np.random.randint(0,n_cols,(n_cols,))
...:
In [131]: np.allclose(org_app(A,n_cols),vectorized_app(A,n_cols))
Out[131]: True
In [132]: %timeit org_app(A,n_cols)
10 loops, best of 3: 33.3 ms per loop
In [133]: %timeit vectorized_app(A,n_cols)
100 loops, best of 3: 1.87 ms per loop
随着行数与 A
中的列数相当,矢量化方法的高内存带宽要求将抵消它的任何明显加速。