为什么 Tensorflow GPU 不能处理更大的批量?
Why Tensorflow GPU is not working with larger batch sizes?
我正在 Tensorflow GPU 1.13.1 上训练自动编码器网络。最初,我使用的批处理大小为 32/64/128,但似乎根本没有使用 GPU。虽然,"memory-usage" 来自“nvidia-smi returns 以下内容:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 31316MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
而且,训练每次都在第 39 步停止。
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 256, 256, 3) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 64, 64, 96) 34944
_________________________________________________________________
batch_normalization_6 (Batch (None, 64, 64, 96) 384
_________________________________________________________________
activation_6 (Activation) (None, 64, 64, 96) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 31, 31, 256) 614656
_________________________________________________________________
batch_normalization_7 (Batch (None, 31, 31, 256) 1024
_________________________________________________________________
activation_7 (Activation) (None, 31, 31, 256) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256) 0
_________________________________________________________________
conv2d_8 (Conv2D) (None, 15, 15, 384) 885120
_________________________________________________________________
batch_normalization_8 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_8 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_9 (Conv2D) (None, 15, 15, 384) 1327488
_________________________________________________________________
batch_normalization_9 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_9 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_10 (Conv2D) (None, 15, 15, 256) 884992
_________________________________________________________________
batch_normalization_10 (Batc (None, 15, 15, 256) 1024
_________________________________________________________________
activation_10 (Activation) (None, 15, 15, 256) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256) 0
_________________________________________________________________
conv2d_11 (Conv2D) (None, 1, 1, 1024) 12846080
_________________________________________________________________
batch_normalization_11 (Batc (None, 1, 1, 1024) 4096
_________________________________________________________________
encoded (Activation) (None, 1, 1, 1024) 0
_________________________________________________________________
reshape_1 (Reshape) (None, 2, 2, 256) 0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128) 819328
_________________________________________________________________
activation_11 (Activation) (None, 4, 4, 128) 0
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64) 204864
_________________________________________________________________
activation_12 (Activation) (None, 8, 8, 64) 0
_________________________________________________________________
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32) 51232
_________________________________________________________________
activation_13 (Activation) (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16) 12816
_________________________________________________________________
activation_14 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8) 3208
_________________________________________________________________
activation_15 (Activation) (None, 64, 64, 8) 0
_________________________________________________________________
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4) 804
_________________________________________________________________
activation_16 (Activation) (None, 128, 128, 4) 0
_________________________________________________________________
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3) 303
=================================================================
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
_________________________________________________________________
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278
即使等待几个小时,训练过程也不会继续进行。
我注意到的另一件不寻常的事情是,将批量大小设置为“1”时,GPU 会被持续使用。
可能是什么问题?
这可能是您放置数据集的驱动器的问题。该代码在任何地方都可以正常工作,但在此服务器上却不行。我更改了驱动器(从一个 NFS 共享到另一个)并且一切正常。
我正在 Tensorflow GPU 1.13.1 上训练自动编码器网络。最初,我使用的批处理大小为 32/64/128,但似乎根本没有使用 GPU。虽然,"memory-usage" 来自“nvidia-smi returns 以下内容:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 31316MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
而且,训练每次都在第 39 步停止。
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 256, 256, 3) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 64, 64, 96) 34944
_________________________________________________________________
batch_normalization_6 (Batch (None, 64, 64, 96) 384
_________________________________________________________________
activation_6 (Activation) (None, 64, 64, 96) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 31, 31, 256) 614656
_________________________________________________________________
batch_normalization_7 (Batch (None, 31, 31, 256) 1024
_________________________________________________________________
activation_7 (Activation) (None, 31, 31, 256) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256) 0
_________________________________________________________________
conv2d_8 (Conv2D) (None, 15, 15, 384) 885120
_________________________________________________________________
batch_normalization_8 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_8 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_9 (Conv2D) (None, 15, 15, 384) 1327488
_________________________________________________________________
batch_normalization_9 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_9 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_10 (Conv2D) (None, 15, 15, 256) 884992
_________________________________________________________________
batch_normalization_10 (Batc (None, 15, 15, 256) 1024
_________________________________________________________________
activation_10 (Activation) (None, 15, 15, 256) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256) 0
_________________________________________________________________
conv2d_11 (Conv2D) (None, 1, 1, 1024) 12846080
_________________________________________________________________
batch_normalization_11 (Batc (None, 1, 1, 1024) 4096
_________________________________________________________________
encoded (Activation) (None, 1, 1, 1024) 0
_________________________________________________________________
reshape_1 (Reshape) (None, 2, 2, 256) 0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128) 819328
_________________________________________________________________
activation_11 (Activation) (None, 4, 4, 128) 0
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64) 204864
_________________________________________________________________
activation_12 (Activation) (None, 8, 8, 64) 0
_________________________________________________________________
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32) 51232
_________________________________________________________________
activation_13 (Activation) (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16) 12816
_________________________________________________________________
activation_14 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8) 3208
_________________________________________________________________
activation_15 (Activation) (None, 64, 64, 8) 0
_________________________________________________________________
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4) 804
_________________________________________________________________
activation_16 (Activation) (None, 128, 128, 4) 0
_________________________________________________________________
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3) 303
=================================================================
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
_________________________________________________________________
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278
即使等待几个小时,训练过程也不会继续进行。
我注意到的另一件不寻常的事情是,将批量大小设置为“1”时,GPU 会被持续使用。
可能是什么问题?
这可能是您放置数据集的驱动器的问题。该代码在任何地方都可以正常工作,但在此服务器上却不行。我更改了驱动器(从一个 NFS 共享到另一个)并且一切正常。