不使用 GPU 的 Tensorflow 代码

Tensorflow code not using GPU

我在 Ubuntu 14.04 上有 Nvidia GTX 1080 运行。我正在尝试使用 tensorflow 1.0.1 实现卷积自动编码器,但该程序似乎根本不使用 GPU。我使用 watch nvidia-smihtop 验证了这一点。 运行程序后输出如下:

  1 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
  2 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
  3 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
  4 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
  5 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
  6 Extracting MNIST_data/train-images-idx3-ubyte.gz
  7 Extracting MNIST_data/train-labels-idx1-ubyte.gz
  8 Extracting MNIST_data/t10k-images-idx3-ubyte.gz
  9 Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
 10 getting into solving the reconstruction loss
 11 Dimension of z i.e. our latent vector is [None, 100]
 12 Dimension of the output of the decoder is [100, 28, 28, 1]
 13 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available     on your machine and could speed up CPU computations.
 14 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are availab    le on your machine and could speed up CPU computations.
 15 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are availab    le on your machine and could speed up CPU computations.
 16 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available     on your machine and could speed up CPU computations.
 17 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available     on your machine and could speed up CPU computations.
 18 W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available     on your machine and could speed up CPU computations.
 19 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
 20 name: GeForce GTX 1080
 21 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
 22 pciBusID 0000:0a:00.0
 23 Total memory: 7.92GiB
 24 Free memory: 7.81GiB
 25 W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x34bccc0
 26 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: 
 27 name: GeForce GTX 1080
 28 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
 29 pciBusID 0000:09:00.0
 30 Total memory: 7.92GiB
 31 Free memory: 7.81GiB
 32 W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x34c0940
 33 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties:
 34 name: GeForce GTX 1080
 35 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
 36 pciBusID 0000:06:00.0
 37 Total memory: 7.92GiB
 38 Free memory: 7.81GiB
 39 W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x34c45c0
 40 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties:
 41 name: GeForce GTX 1080
 42 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
 43 pciBusID 0000:05:00.0
 44 Total memory: 7.92GiB
 45 Free memory: 7.81GiB
 46 I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
 47 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
 48 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
 49 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
 50 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
 51 I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus i    d: 0000:0a:00.0)
 52 I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus i    d: 0000:09:00.0)
 53 I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus i    d: 0000:06:00.0)
 54 I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus i    d: 0000:05:00.0)

我的代码会不会有问题,我也试过在构建图形之前使用 with tf.device("/gpu:0"): 指定它使用特定设备。如果需要任何进一步的信息,请告诉我。

编辑 1 nvidia-smi 的输出

exx@ubuntu:~$ nvidia-smi
Wed Apr 19 20:50:07 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:05:00.0     Off |                  N/A |
| 38%   54C    P8    12W / 180W |   7715MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:06:00.0     Off |                  N/A |
| 38%   55C    P8     8W / 180W |   7715MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:09:00.0     Off |                  N/A |
| 36%   50C    P8     8W / 180W |   7715MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:0A:00.0     Off |                  N/A |
| 35%   54C    P2    41W / 180W |   7833MiB /  8113MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     24228    C   python3                                       7713MiB |
|    1     24228    C   python3                                       7713MiB |
|    2     24228    C   python3                                       7713MiB |
|    3     24228    C   python3                                       7831MiB |
+-----------------------------------------------------------------------------+

htop 表明它使用了大约 100% 的 CPU 核心之一。我说它不使用 gpu 的依据是因为 GPU 使用率百分比。这个显示8%,但通常是0%。

所以您 运行正在使用 GPU,从那个角度来看一切都配置正确,但看起来速度真的很差。确保你 运行 nvidia-smi 多次以了解它的工作情况,它可能显示 100% 一个时刻和另一个 8%。

从 GPU 获得大约 80% 的利用率是正常的,因为在每个 运行 之前将每个批次从核心内存加载到 GPU 会浪费时间(即将推出新功能以改进那,GPU 在 TF 中排队)。

如果您从 GPU 获得的性能低于 ~80%,那么您就做错了。我想到了 2 个可能和常见的原因:

1) 你在步骤之间做了一堆预处理,所以 GPU 运行 很快,但是你被阻塞在一个 CPU 线程上做一堆非张量流工作。将其移动到它自己的线程,从 python Queue

加载数据到 GPU

2) 大量数据在 CPU 和 GPU 内存之间来回移动。如果这样做,CPU 和 GPU 之间的带宽可能会成为瓶颈。

尝试在 training/inference 批处理的开始和结束之间添加一些计时器,看看您是否在 tensorflow 操作之外花费了很多时间。