与默认计数器相比，GPU 上的计数器出奇地慢？

Question

编辑 -- 请参阅底部的编辑，gpu 上的 tensorflow 对于递增大型计数器向量来说速度快得惊人。

我想看看使用 GPU 是否给我任何速度优势，下面这个程序简单地计算了 200,000 次，一次使用 tensor-flow 和 GPU，另一次使用 plain-ol-python。张量流循环需要 14 秒以上才能运行而普通的 ol python 只需要 .013 秒？我究竟做错了什么？代码如下：

#!/usr/bin/env python
import tensorflow as tf
import sys, time
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable(0, name="counter")                                                                                            

MAX=10000

# Create an Op to add one to `state`.
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

# Variables must be initialized by running an `init` Op after having

# launched the graph.  We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()

if __name__ == '__main__' :
    # Launch the graph and run the ops.
    with tf.Session() as sess:
        # Run the 'init' op
        sess.run(init_op)
        # Print the initial value of 'state'
        print sess.run(state)
        # Run the op that updates 'state' and print 'state'.
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            sess.run(update)

        print str(sess.run(state)) + str(time.time() - t0) 

        count = 0 
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            count+=1

        print str(count) + str(time.time() - t0)

输出这个

$ ./helloworld.py 200000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3649540096
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
0
starting ...
20000014.444382906
starting ...
2000000.0131969451904

编辑——在建议更改为计数器向量之后，gpu 上的 tensorflow 快得令人难以置信。

每个向量有 10,000 个计数器：

#!/usr/bin/env python
import tensorflow as tf
import sys, time

CSIZE=10000
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable([0 for x in range(CSIZE)], name="counter")

MAX=1000

# Create an Op to add one to `state`.
one = tf.constant([1 for x in range(CSIZE)])
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

# Variables must be initialized by running an `init` Op after having

# launched the graph.  We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()

if __name__ == '__main__' :
    # Launch the graph and run the ops.
    with tf.Session() as sess:
        # Run the 'init' op
        sess.run(init_op)
        # Print the initial value of 'state'
        print sess.run(state)
        # Run the op that updates 'state' and print 'state'.
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            sess.run(update)

        print str(sess.run(state)) + str(time.time() - t0) 

        counters = [0 for x in range(CSIZE)]                                                                                      
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            for x in range(0, len(counters)) :
                counters[x]+=1

        print str(counters[0]) + ", " +  str(time.time() - t0)

输出：

$ ./helloworld.py                                                                                                           127 ↵
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3645083648
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]0.997926950455
starting ...
10000, 9.66100215912

有 100,000 个计数器，输出为：

I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3653734400
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]1.57860684395
starting ...
^CTraceback (most recent call last):
  File "./helloworld.py", line 40, in <module>
    for x in range(0, len(counters)) :
KeyboardInterrupt

普通的olpython花了一分钟直到我放弃

Answer 1

从某种意义上说，与必须执行的指令数相比，这两个程序 "surprisingly" 都很慢。单元素计数器在 14.4 秒内执行 200,000 条递增指令，使用 200,000 次调用 sess.run()。向量计数器在 0.99 秒内执行 100,000,000 条递增指令，使用 10,000 次调用 sess.run()。如果你用 C 编写这些程序，你会发现每次计数器递增最多需要几纳秒，那么时间花在哪里了？

TensorFlow 强加了一些每步开销，每次调用 Session.run() 大约几微秒。这是一个 known issue，团队正试图减少这种情况，但对于大多数神经网络算法来说，通常在单个步骤中运行很少会出现这种情况。开销可以分解如下：

每步分派开销： TensorFlow 会话 API 是基于字符串的，因此必须进行一些字符串操作和散列来识别正确的子图 [= =61=] 在每一步。这涉及一些 Python 和一些 C++ 代码。
每个操作的调度开销：这是用 C++ 实现的，涉及设置上下文和调度 TensorFlow 内核。您的计数器基准测试中有三个操作（VariableOp、Add 和 Assign）。
GPU 内核调度开销： 将内核调度到 GPU 涉及调用 GPU 驱动程序的内核条目。
GPU 复制开销： 也许令人惊讶的是，sess.run(update) 会将结果从 GPU 复制回来，因为 update 是一个 Tensor 对象（对应赋值的结果），它的值会被调用返回。

您可以尝试一些方法来加速两个版本的代码。

使用 state.assign_add(one) 而不是单独的 tf.add 和 tf.assign ops 将减少每个 op 的调度开销（并且还会做一个更有效的 in-位置加法）。
调用sess.run(update.op)将避免在每一步都将副本复制回客户端。

与默认计数器相比，GPU 上的计数器出奇地慢？

Counter on GPU insanely slow compared to default counter?

python

cuda

tensorflow