TensorFlow 似乎不使用 GPU

Question

我在 Windows 8 和 Python 3.5 上使用 TensorFlow。我更改了 this 简短示例以查看 GPU 支持（Titan X) 是否有效。不幸的是，运行时使用（tf.device("/gpu:0"）和不使用（tf.device("/cpu:0"））使用 GPU相同的。 Windows CPU 监控显示，在这两种情况下，计算期间 CPU 负载大约为 100%。

这是代码示例：

import numpy as np
import tensorflow as tf
import datetime

#num of multiplications to perform
n = 100

# Create random large matrix
matrix_size = 1e3
A = np.random.rand(matrix_size, matrix_size).astype('float32')
B = np.random.rand(matrix_size, matrix_size).astype('float32')

# Creates a graph to store results
c1 = []

# Define matrix power
def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

with tf.device("/gpu:0"):
    a = tf.constant(A)
    b = tf.constant(B)
    #compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

    sum = tf.add_n(c1)

t1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    # Runs the op.
    sess.run(sum)
t2 = datetime.datetime.now()

print("computation time: " + str(t2-t1))

这里是 GPU 案例的输出：

I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cublas64_80.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cudnn64_5.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library cufft64_80.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library nvcuda.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:128] successfully opened CUDA library curand64_80.dll locally
C:/Users/schlichting/.spyder-py3/temp.py:16: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  A = np.random.rand(matrix_size, matrix_size).astype('float32')
C:/Users/schlichting/.spyder-py3/temp.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  B = np.random.rand(matrix_size, matrix_size).astype('float32')
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:01:00.0
Total memory: 12.00GiB
Free memory: 2.40GiB
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:906] DMA: 0 
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:916] 0:   Y 
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0)
D c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0

Ievice mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0

C:0/task:0/gpu:0
host/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_108: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_109: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_110: (MatMul)/job:localhost/replicacalhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_107: (MatMul)/job:localgpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_103: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_104: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_105: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_106: (MatMul)/job:lo c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] Const_1: (Const)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_100: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_101: (MatMul)/job:localhost/replica:0/task:0/gpu:0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\simple_placer.cc:827] MatMul_102: (MatMul)/job:localhost/replica:0/task:0/Ionst_1: (Const): /job:localhost/replica:0/task:0/gpu:0


MatMul_100: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_101: (MatMul): /job:localhost/replica:0/task:0/gpu:0
...
MatMul_198: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_199: (MatMul): /job:localhost/replica:0/task:0/gpu:0
Const: (Const): /job:localhost/replica:0/task:0/gpu:0
MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_1: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_2: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_3: (MatMul): /job:localhost/replica:0/task:0/gpu:0
...
MatMul_98: (MatMul): /job:localhost/replica:0/task:0/gpu:0
MatMul_99: (MatMul): /job:localhost/replica:0/task:0/gpu:0
AddN: (AddN): /job:localhost/replica:0/task:0/gpu:0
computation time: 0:00:05.066000

在 CPU 的情况下，输出是相同的，cpu:0 而不是 gpu:0。计算时间不变。即使我使用更多操作，例如运行时间约为 1 分钟，GPU 和 CPU 相等。非常感谢！

Answer 1

根据日志信息，特别是设备布局，您的代码使用 GPU。只是运行的时间是一样的。我的猜测是：

c1.append(matpow(a, n))
c1.append(matpow(b, n))

是您代码中的瓶颈，不断地将大矩阵从 GPU 内存移动到 RAM。你可以尝试：

将矩阵大小更改为1e4 x 1e4

with tf.device("/gpu:0"):
  A = tf.random_normal([matrix_size, matrix_size])
  B = tf.random_normal([matrix_size, matrix_size])
  C = tf.matmul(A, B)
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
  t1 = datetime.datetime.now()
  sess.run(C)
  t2 = datetime.datetime.now()

Answer 2

举例来说，创建 tensorflow 会话需要 4.9 秒，而实际计算在 cpu 上只需要 0.1 秒，因此 cpu 上的时间为 5.0 秒。现在说在 gpu 上创建会话也需要 4.9 秒，但计算需要 0.01 秒，给出 4.91 秒的时间。你几乎看不出区别。创建会话是程序启动时的一次性开销成本。你不应该把它包括在你的时间里。当你第一次调用 sess.run 时，tensorflow 也会做一些 compilation/optimization，这使得第一个运行更慢。

试试这样计时。

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    # Runs the op the first time.
    sess.run(sum)
    t1 = datetime.datetime.now()
    for i in range(1000):
        sess.run(sum)
    t2 = datetime.datetime.now()

如果这不能解决问题，也可能是您的计算没有足够的并行性让 GPU 真正击败 cpu。增加矩阵大小可能会带来差异。

TensorFlow 似乎不使用 GPU

TensorFlow seems not to use GPU

python

gpu

python-3.5

tensorflow

tensorflow-gpu