与默认计数器相比,GPU 上的计数器出奇地慢?
Counter on GPU insanely slow compared to default counter?
编辑 -- 请参阅底部的编辑,gpu 上的 tensorflow 对于递增大型计数器向量来说速度快得惊人。
我想看看使用 GPU 是否给我任何速度优势,下面这个程序简单地计算了 200,000 次,一次使用 tensor-flow 和 GPU,另一次使用 plain-ol-python。张量流循环需要 14 秒以上才能 运行 而普通的 ol python 只需要 .013 秒?我究竟做错了什么?代码如下:
#!/usr/bin/env python
import tensorflow as tf
import sys, time
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable(0, name="counter")
MAX=10000
# Create an Op to add one to `state`.
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)
# Variables must be initialized by running an `init` Op after having
# launched the graph. We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()
if __name__ == '__main__' :
# Launch the graph and run the ops.
with tf.Session() as sess:
# Run the 'init' op
sess.run(init_op)
# Print the initial value of 'state'
print sess.run(state)
# Run the op that updates 'state' and print 'state'.
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
sess.run(update)
print str(sess.run(state)) + str(time.time() - t0)
count = 0
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
count+=1
print str(count) + str(time.time() - t0)
输出这个
$ ./helloworld.py 200000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3649540096
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
0
starting ...
20000014.444382906
starting ...
2000000.0131969451904
编辑——在建议更改为计数器向量之后,gpu 上的 tensorflow 快得令人难以置信。
每个向量有 10,000 个计数器:
#!/usr/bin/env python
import tensorflow as tf
import sys, time
CSIZE=10000
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable([0 for x in range(CSIZE)], name="counter")
MAX=1000
# Create an Op to add one to `state`.
one = tf.constant([1 for x in range(CSIZE)])
new_value = tf.add(state, one)
update = tf.assign(state, new_value)
# Variables must be initialized by running an `init` Op after having
# launched the graph. We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()
if __name__ == '__main__' :
# Launch the graph and run the ops.
with tf.Session() as sess:
# Run the 'init' op
sess.run(init_op)
# Print the initial value of 'state'
print sess.run(state)
# Run the op that updates 'state' and print 'state'.
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
sess.run(update)
print str(sess.run(state)) + str(time.time() - t0)
counters = [0 for x in range(CSIZE)]
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
for x in range(0, len(counters)) :
counters[x]+=1
print str(counters[0]) + ", " + str(time.time() - t0)
输出:
$ ./helloworld.py 127 ↵
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3645083648
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]0.997926950455
starting ...
10000, 9.66100215912
有 100,000 个计数器,输出为:
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3653734400
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]1.57860684395
starting ...
^CTraceback (most recent call last):
File "./helloworld.py", line 40, in <module>
for x in range(0, len(counters)) :
KeyboardInterrupt
普通的olpython花了一分钟直到我放弃
从某种意义上说,与必须执行的指令数相比,这两个程序 "surprisingly" 都很慢。单元素计数器在 14.4 秒内执行 200,000 条递增指令,使用 200,000 次调用 sess.run()
。向量计数器在 0.99 秒内执行 100,000,000 条递增指令,使用 10,000 次调用 sess.run()
。如果你用 C 编写这些程序,你会发现每次计数器递增最多需要几纳秒,那么时间花在哪里了?
TensorFlow 强加了一些每步开销,每次调用 Session.run()
大约几微秒。这是一个 known issue,团队正试图减少这种情况,但对于大多数神经网络算法来说,通常在单个步骤中 运行 很少会出现这种情况。开销可以分解如下:
- 每步分派开销: TensorFlow 会话 API 是基于字符串的,因此必须进行一些字符串操作和散列来识别正确的子图 [= =61=] 在每一步。这涉及一些 Python 和一些 C++ 代码。
- 每个操作的调度开销:这是用 C++ 实现的,涉及设置上下文和调度 TensorFlow 内核。您的计数器基准测试中有三个操作(
VariableOp
、Add
和 Assign
)。
- GPU 内核调度开销: 将内核调度到 GPU 涉及调用 GPU 驱动程序的内核条目。
- GPU 复制开销: 也许令人惊讶的是,
sess.run(update)
会将结果从 GPU 复制回来,因为 update
是一个 Tensor
对象(对应赋值的结果),它的值会被调用返回。
您可以尝试一些方法来加速两个版本的代码。
使用 state.assign_add(one)
而不是单独的 tf.add
和 tf.assign
ops 将减少每个 op 的调度开销(并且还会做一个更有效的 in-位置加法)。
调用sess.run(update.op)
将避免在每一步都将副本复制回客户端。
编辑 -- 请参阅底部的编辑,gpu 上的 tensorflow 对于递增大型计数器向量来说速度快得惊人。
我想看看使用 GPU 是否给我任何速度优势,下面这个程序简单地计算了 200,000 次,一次使用 tensor-flow 和 GPU,另一次使用 plain-ol-python。张量流循环需要 14 秒以上才能 运行 而普通的 ol python 只需要 .013 秒?我究竟做错了什么?代码如下:
#!/usr/bin/env python
import tensorflow as tf
import sys, time
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable(0, name="counter")
MAX=10000
# Create an Op to add one to `state`.
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)
# Variables must be initialized by running an `init` Op after having
# launched the graph. We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()
if __name__ == '__main__' :
# Launch the graph and run the ops.
with tf.Session() as sess:
# Run the 'init' op
sess.run(init_op)
# Print the initial value of 'state'
print sess.run(state)
# Run the op that updates 'state' and print 'state'.
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
sess.run(update)
print str(sess.run(state)) + str(time.time() - t0)
count = 0
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
count+=1
print str(count) + str(time.time() - t0)
输出这个
$ ./helloworld.py 200000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3649540096
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
0
starting ...
20000014.444382906
starting ...
2000000.0131969451904
编辑——在建议更改为计数器向量之后,gpu 上的 tensorflow 快得令人难以置信。
每个向量有 10,000 个计数器:
#!/usr/bin/env python
import tensorflow as tf
import sys, time
CSIZE=10000
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable([0 for x in range(CSIZE)], name="counter")
MAX=1000
# Create an Op to add one to `state`.
one = tf.constant([1 for x in range(CSIZE)])
new_value = tf.add(state, one)
update = tf.assign(state, new_value)
# Variables must be initialized by running an `init` Op after having
# launched the graph. We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()
if __name__ == '__main__' :
# Launch the graph and run the ops.
with tf.Session() as sess:
# Run the 'init' op
sess.run(init_op)
# Print the initial value of 'state'
print sess.run(state)
# Run the op that updates 'state' and print 'state'.
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
sess.run(update)
print str(sess.run(state)) + str(time.time() - t0)
counters = [0 for x in range(CSIZE)]
print "starting ..."
t0 = time.time()
for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
for x in range(0, len(counters)) :
counters[x]+=1
print str(counters[0]) + ", " + str(time.time() - t0)
输出:
$ ./helloworld.py 127 ↵
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3645083648
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]0.997926950455
starting ...
10000, 9.66100215912
有 100,000 个计数器,输出为:
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3653734400
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
[0 0 0 ..., 0 0 0]
starting ...
[10000 10000 10000 ..., 10000 10000 10000]1.57860684395
starting ...
^CTraceback (most recent call last):
File "./helloworld.py", line 40, in <module>
for x in range(0, len(counters)) :
KeyboardInterrupt
普通的olpython花了一分钟直到我放弃
从某种意义上说,与必须执行的指令数相比,这两个程序 "surprisingly" 都很慢。单元素计数器在 14.4 秒内执行 200,000 条递增指令,使用 200,000 次调用 sess.run()
。向量计数器在 0.99 秒内执行 100,000,000 条递增指令,使用 10,000 次调用 sess.run()
。如果你用 C 编写这些程序,你会发现每次计数器递增最多需要几纳秒,那么时间花在哪里了?
TensorFlow 强加了一些每步开销,每次调用 Session.run()
大约几微秒。这是一个 known issue,团队正试图减少这种情况,但对于大多数神经网络算法来说,通常在单个步骤中 运行 很少会出现这种情况。开销可以分解如下:
- 每步分派开销: TensorFlow 会话 API 是基于字符串的,因此必须进行一些字符串操作和散列来识别正确的子图 [= =61=] 在每一步。这涉及一些 Python 和一些 C++ 代码。
- 每个操作的调度开销:这是用 C++ 实现的,涉及设置上下文和调度 TensorFlow 内核。您的计数器基准测试中有三个操作(
VariableOp
、Add
和Assign
)。 - GPU 内核调度开销: 将内核调度到 GPU 涉及调用 GPU 驱动程序的内核条目。
- GPU 复制开销: 也许令人惊讶的是,
sess.run(update)
会将结果从 GPU 复制回来,因为update
是一个Tensor
对象(对应赋值的结果),它的值会被调用返回。
您可以尝试一些方法来加速两个版本的代码。
使用
state.assign_add(one)
而不是单独的tf.add
和tf.assign
ops 将减少每个 op 的调度开销(并且还会做一个更有效的 in-位置加法)。调用
sess.run(update.op)
将避免在每一步都将副本复制回客户端。