我想在 Theano 中将一个函数映射到向量的每个元素，我可以不使用扫描来实现吗？

Question

说一个计算数组每个索引处的出现次数的函数：

import theano
import theano.tensor as T


A = T.vector("A")
idx_range = T.arange(A.shape[0])

result, updates = theano.scan(fn=lambda idx: T.sum(A[:idx+1]), sequences=idx_range)

count_ones = theano.function(inputs=[A], outputs=result)

print count_ones([0,0,1,0,0,1,1,1])
# gives [ 0.  0.  1.  1.  1.  2.  3.  4.]

如前所述here，使用扫描可能效率不高。另外，theano.scan 总是在我的机器上产生“RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *”。

所以我想知道在 Theano 中是否有更好的映射函数的方法？
提前致谢。

编辑：
我刚刚意识到这是一个糟糕的例子，显然有一种更有效的方法可以像这样循环一次向量：

result, updates = theano.scan(fn=lambda prior_result, a: prior_result + a,
                              outputs_info=T.alloc(np.int32(0), 1),
                              sequences=A,
                              n_steps=A.shape[0])

但是根据@Daniel Renshaw 的回答，因为

the computation in one step is dependent on the same computation at some earlier step

所以实际上我在这方面无法避免使用扫描，对吗？

编辑：
我想到了一种 vercotrizing 的方法：

A = T.vector()
in_size = 8
# a matrix with ones at and below the given diagonal and zeros elsewhere
mask = theano.shared(numpy.tri(in_size))  
result = T.dot(mask, A)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones(numpy.asarray([0,0,1,0,0,1,1,1]))

但在这种情况下，我必须提前知道输入的大小（除非我可以像动态矩阵一样制作 numpy.tri？）。
欢迎大家提出意见。 :)

编辑：
我使用 512D 输入数组和 10000 次迭代对这三种方法进行了基准测试，得到了以下结果：

将求和函数映射到每个元素：CPU 16s GPU 140s
使用扫描遍历数组：CPU 13s GPU 32s
矢量化：CPU 0.8s GPU 0.8s（实际上我不认为theano已经让GPU来做这个

Answer 1

在最一般的情况下，如果不对函数做出任何假设，则必须使用扫描。然而，许多（也许是大多数？）有用的函数可以向量化，这样就不需要扫描了。正如问题编辑中指出的那样，示例函数当然可以在不使用扫描的情况下应用于输入。

是否需要scan取决于需要应用的功能。肯定将需要扫描的情况是当一个步骤中的计算依赖于前面某个步骤中的相同计算时。

P.S。可以安全地忽略有关二进制不兼容的警告。

我想在 Theano 中将一个函数映射到向量的每个元素，我可以不使用扫描来实现吗？

I want to map a function to each element of a vector in Theano, can I do it without using scan?

parallel-processing

numpy

vectorization

theano