如何以有效的方式更新多维 numpy 数组的多个副本

Question

主要目标：假设我有一个多维数组。我还有一个 0-1 索引集，对应于每一行的每一列。例如，如果我的数组是 [[3,6,7,8], [1,32,45,7]]，我会将索引设置为 [[1,0,1,1], [0,0,1,1]]。我想复制数组的每一行 n 次。然后，我想随机增加对应索引等于1的每个元素。

import time
import random
import numpy as np

def foo(arr, upper_bound, index_set, first_set_size, sec_set_size, limit):
    iter =0

    my_array = np.zeros((first_set_size*sec_set_size, limit)) #each row is copied |sec_set_size| times
    it =0
    for i in range(first_set_size):
        for j in range(sec_set_size):
            my_array[it] = arr[i] #copy the elements from the corresponding row
            for k in range(limit):
                if index_set[i][k]==1: #update the elements whose indices are one
                    temp = arr[i][k]   #get the current value
                    my_array[it][k]  =temp + random.randint(1,upper_bound-temp) #I use fastrand.pcg32bounded here. Update the value. 
            it +=1
    return my_array


upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set= np.array([[random.randint(0,1) for j in range(limit)] for i in range(first_set_size)]) #each elements has an index which is either 1 or 0

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound,index_set, first_set_size, sec_set_size, limit)
print("time taken: %s " % (time.time() - start_time))

增加限制并设置大小后，代码需要几分钟时间。有什么办法可以更快/更有效地执行此操作？我在这上面花了很多时间，但无法提高我的实施速度。

编辑：假设我的初始数组是：

[[11 23 24 17  0]
 [ 1 23 12 19  5]
 [20 15  1 17 17]
 [ 3  8  7  0 24]]

此外，我的索引集为；

[[1 0 0 0 1]
 [1 0 1 0 0]
 [1 1 1 1 0]
 [0 1 0 1 1]]

如果sec_set_size=5，我想复制每一行并增加每个元素的值，如果它们的索引是一个的话。

最后的结果应该是这样的；

[[39. 23. 24. 17. 44.]
 [50. 23. 24. 17. 27.]
 [42. 23. 24. 17. 24.]
 [45. 23. 24. 17. 11.]
 [49. 23. 24. 17. 43.]
 [23. 23. 44. 19.  5.]
 [10. 23. 37. 19.  5.]
 [14. 23. 29. 19.  5.]
 [12. 23. 22. 19.  5.]
 [ 5. 23. 15. 19.  5.]
 [36. 45. 26. 37. 17.]
 [24. 40. 35. 38. 17.]
 [34. 20. 24. 31. 17.]
 [27. 16.  9. 20. 17.]
 [37. 37.  6. 37. 17.]
 [ 3. 50.  7. 46. 47.]
 [ 3. 13.  7. 37. 44.]
 [ 3. 23.  7. 32. 29.]
 [ 3. 10.  7. 22. 41.]
 [ 3. 22.  7. 32. 41.]]

Answer 1

Numpy 就是关于矢量化的。如果你使用 python 循环，你可能做错了。

首先，所有的随机数生成器都被矢量化了：

index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

你在上面那行做对了。

接下来，要多次复制行，可以使用np.repeat:

my_array = np.repeat(arr, sec_set_size, axis=0)

请注意，您根本不需要 first_set_size。 arr.shape[0] 是多余的。您可以对布尔掩码执行相同的操作以使形状匹配：

index_set = np.repeat(index_set, sec_set_size, axis=0)

现在您可以使用适当数量的随机生成元素更新被 index_set 屏蔽的 my_array 选项：

my_array[index_set] += np.random.randint(1, upper_bound - my_array[index_set])

你的整个程序减少到大约四行（非常快），加上一些初始化：

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    index_set = np.repeat(index_set, sec_set_size, axis=0)
    my_array[index_set] += np.random.randint(1, upper_bound - my_array[index_set])
    return my_array

upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound, index_set, sec_set_size, limit)
print(f"time taken: {time.time() - start_time}")

您可能想尝试使用索引而不是布尔掩码。它将使索引更有效，因为非零元素的数量不需要重新计算两次，但另一方面设置有点昂贵：

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    r, c = np.where(index_set)
    r = (sec_set_size * r[:, None] + np.arange(sec_set_size)).ravel()
    c = np.repeat(c, sec_set_size)
    my_array[r, c] += np.random.randint(1, upper_bound - my_array[r, c])
    return my_array

如何以有效的方式更新多维 numpy 数组的多个副本

how to update multiple copies of multi dimensional numpy array in an efficient way

python

performance

numpy

vectorization