运行 CPU 上的 Tensorflow 比 GPU 上的 运行 更快
Running Tensorflow on CPU is faster than running it on GPU
我有一台 ASUS n552vw 笔记本电脑,配备 4GB 专用 Geforce GTX 960M 显卡。我将这些代码行放在代码的开头,以比较使用 GPU 或 CPU 的训练速度,我看到它似乎使用 CPU wins!
对于 GPU:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
对于CPU:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
我已经安装了 CUDA、cuDNN、tensorflow-gpu 等来提高我的训练速度,但似乎发生了相反的事情!
当我尝试第一个代码时,它说(在执行开始之前):
Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:49:41.828287: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:49:42.457662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2019-08-02 18:49:42.458819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-02 18:49:43.776498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-02 18:49:43.777007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-08-02 18:49:43.777385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-08-02 18:49:43.777855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3050 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2019-08-02 18:49:51.834610: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
它真的很慢 [Finished in 263.2s]
,但是当我尝试第二个代码时它说:
Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:51:43.021867: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:51:43.641123: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-08-02 18:51:43.645072: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: DESKTOP-UQ8B9FK
2019-08-02 18:51:43.645818: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: DESKTOP-UQ8B9FK
而且比第一个代码[Finished in 104.7s]
快多了!怎么可能??
编辑:这是与 Tensorflow
相关的代码部分:
model = Sequential()
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = False)))
model.add(Dropout(dp))
model.add(RepeatVector(rp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add(TimeDistributed(Dense(ds)))
这里有两个相关问题:
- 模型需要 "big enough" 才能从 GPU 加速中获益,因为训练数据需要传输到 GPU,并且需要从 GPU 下载新的权重,这种开销减少了效率,让事情变得更慢。
- 对于循环层,将它们并行化并不容易,因为它们有很多跨时间步的顺序计算。您可能会考虑使用 CuDNNLSTM 层而不是普通的 LSTM,因为它针对 GPU 使用进行了优化。
一般来说,对于小型模型,GPU 上的训练可能不会比 CPU 上的训练快。
我有一台 ASUS n552vw 笔记本电脑,配备 4GB 专用 Geforce GTX 960M 显卡。我将这些代码行放在代码的开头,以比较使用 GPU 或 CPU 的训练速度,我看到它似乎使用 CPU wins!
对于 GPU:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
对于CPU:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
我已经安装了 CUDA、cuDNN、tensorflow-gpu 等来提高我的训练速度,但似乎发生了相反的事情!
当我尝试第一个代码时,它说(在执行开始之前):
Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:49:41.828287: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:49:42.457662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2019-08-02 18:49:42.458819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-02 18:49:43.776498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-02 18:49:43.777007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-08-02 18:49:43.777385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-08-02 18:49:43.777855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3050 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2019-08-02 18:49:51.834610: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
它真的很慢 [Finished in 263.2s]
,但是当我尝试第二个代码时它说:
Train on 2128 samples, validate on 22 samples
Epoch 1/1
2019-08-02 18:51:43.021867: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-08-02 18:51:43.641123: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-08-02 18:51:43.645072: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: DESKTOP-UQ8B9FK
2019-08-02 18:51:43.645818: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: DESKTOP-UQ8B9FK
而且比第一个代码[Finished in 104.7s]
快多了!怎么可能??
编辑:这是与 Tensorflow
相关的代码部分:
model = Sequential()
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences = False)))
model.add(Dropout(dp))
model.add(RepeatVector(rp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add((LSTM(un , return_sequences= True)))
model.add(Dropout(dp))
model.add(TimeDistributed(Dense(ds)))
这里有两个相关问题:
- 模型需要 "big enough" 才能从 GPU 加速中获益,因为训练数据需要传输到 GPU,并且需要从 GPU 下载新的权重,这种开销减少了效率,让事情变得更慢。
- 对于循环层,将它们并行化并不容易,因为它们有很多跨时间步的顺序计算。您可能会考虑使用 CuDNNLSTM 层而不是普通的 LSTM,因为它针对 GPU 使用进行了优化。
一般来说,对于小型模型,GPU 上的训练可能不会比 CPU 上的训练快。