不清楚为什么带有 CUDA 和 cuDNN 的 Deeplearning4j 因 OutOfMemory 而失败
Not clear why Deeplearning4j with CUDA and cuDNN fails with OutOfMemory
环境:Windows7、GeForce GTX 750、CUDA 10.0、cuDNN 7.4
Maven 依赖项:
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-10.0</artifactId>
<version>1.0.0-beta3</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-10.0</artifactId>
<version>1.0.0-beta3</version>
</dependency>
我每 10 个小批量检查测试性能。我曾经调用 net.evaluate(),但这给了我这个错误:
Exception in thread "AMDSI prefetch thread" java.lang.RuntimeException: java.lang.RuntimeException: Failed to allocate 637074016 bytes from DEVICE [0] memory
at org.deeplearning4j.datasets.iterator.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:396)
Caused by: java.lang.RuntimeException: Failed to allocate 637074016 bytes from DEVICE [0] memory
at org.nd4j.jita.memory.CudaMemoryManager.allocate(CudaMemoryManager.java:76)
at org.nd4j.jita.workspace.CudaWorkspace.init(CudaWorkspace.java:88)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.initializeWorkspace(Nd4jWorkspace.java:508)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:651)
at org.deeplearning4j.datasets.iterator.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:372)
然后我从 net.evaluate() 切换到 net.output() 并且 training = false 并将测试集的大小从 100 减少到 20。
这没有错误。我试图将记录数增加到 30,它显示了这个警告,但继续工作:
2019-01-12 14:47:44 WARN org.deeplearning4j.nn.layers.BaseCudnnHelper Cannot allocate 300000000 bytes of device memory (CUDA error = 2), proceeding with host memory
我能理解是显卡内存不足(GeForce GTX 750 Spec显示内存为1G),但是
因为它可以使用主机内存,所以我将测试集大小增加回 100,并因以下错误而永久失败:
2019-01-12 14:59:29 WARN org.deeplearning4j.nn.layers.BaseCudnnHelper Cannot allocate 1000000000 bytes of device memory (CUDA error = 2), proceeding with host memory
Exception in thread "main" 2019-01-12 14:59:39 ERROR org.deeplearning4j.util.CrashReportingUtil >>> Out of Memory Exception Detected. Memory crash dump written to: C:\DATA\Projects\dl4j-language-model\dl4j-memory-crash-dump-1547294372940_1.txt
java.lang.OutOfMemoryError: Failed to allocate memory within limits: totalBytes (470M + 7629M) > maxBytes (7851M)
2019-01-12 14:59:39 WARN org.deeplearning4j.util.CrashReportingUtil Memory crash dump reporting can be disabled with CrashUtil.crashDumpsEnabled(false) or using system property -Dorg.deeplearning4j.crash.reporting.enabled=false
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:580)
at org.deeplearning4j.nn.layers.BaseCudnnHelper$DataCache.<init>(BaseCudnnHelper.java:119)
2019-01-12 14:59:39 WARN org.deeplearning4j.util.CrashReportingUtil Memory crash dump reporting output location can be set with CrashUtil.crashDumpOutputDirectory(File) or using system property -Dorg.deeplearning4j.crash.reporting.directory=<path>
at org.deeplearning4j.nn.layers.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:509)
现在,我假设 maxBytes (7851M)
指的是堆大小(JVM 使用 -Xmx8G -Xms8G 运行),但我还输出 Runtime
freeMemory()
和 totalMemory()
并且它在崩溃之前显示了以下内容,这是足够的可用内存:
2019-01-12 15:29:20 INFO Free memory: 7722607976/8232370176
所以我的问题是,totalBytes (470M + 7629M)
数字从何而来,如果 JVM 内部有空闲内存,为什么不能分配所需的 1G?
下面是内存崩溃报告:
Deeplearning4j OOM Exception Encountered for ComputationGraph
Timestamp: 2019-01-12 14:59:32.940
Thread ID 1
Thread Name main
Stack Trace:
java.lang.OutOfMemoryError: Failed to allocate memory within limits: totalBytes (470M + 7629M) > maxBytes (7851M)
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:580)
at org.deeplearning4j.nn.layers.BaseCudnnHelper$DataCache.<init>(BaseCudnnHelper.java:119)
at org.deeplearning4j.nn.layers.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:509)
at org.deeplearning4j.nn.layers.recurrent.LSTMHelpers.activateHelper(LSTMHelpers.java:205)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activateHelper(LSTM.java:163)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activate(LSTM.java:140)
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward(LayerVertex.java:110)
at org.deeplearning4j.nn.graph.ComputationGraph.outputOfLayersDetached(ComputationGraph.java:2316)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1727)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1686)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1672)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.testOutputAndScore(CharacterSequenceValuePredictorNet.java:195)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.train(CharacterSequenceValuePredictorNet.java:166)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.main(CharacterSequenceValuePredictorNet.java:283)
========== Memory Information ==========
----- Version Information -----
Deeplearning4j Version 1.0.0-beta3
Deeplearning4j CUDA deeplearning4j-cuda-10.0
----- System Information -----
Operating System Microsoft Windows 7 SP1
CPU Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU Cores - Physical 4
CPU Cores - Logical 8
Total System Memory 15.97 GB (17144102912)
Number of GPUs Detected 1
Name CC Total Memory Used Memory Free Memory
GeForce GTX 750 5.0 2 GB (2147483648) 1.67 GB (1795002368) 336.15 MB (352481280)
----- ND4J Environment Information -----
Data Type FLOAT
backend CUDA
blas.vendor CUBLAS
os Windows 7
----- Memory Configuration -----
JVM Memory: XMX 7.67 GB (8232370176)
JVM Memory: current 7.67 GB (8232370176)
JavaCPP Memory: Max Bytes 7.67 GB (8232370176)
JavaCPP Memory: Max Physical 15.33 GB (16464740352)
JavaCPP Memory: Current Bytes 470.26 MB (493106209)
JavaCPP Memory: Current Physical 3.35 GB (3601498112)
Periodic GC Enabled true
Periodic GC Frequency 100 ms
----- Workspace Information -----
Workspaces: # for current thread 4
Current thread workspaces:
Name State Size # Cycles
WS_LAYER_WORKING_MEM CLOSED 117.40 MB (123100000) 6802
WS_ALL_LAYERS_ACT CLOSED 19.41 MB (20349840) 2400
WS_LAYER_ACT_0 CLOSED 6.23 MB (6528000) 1601
WS_LAYER_ACT_1 CLOSED 381.47 MB (400000000) 1601
Workspaces total size 524.50 MB (549977840)
Helper Workspaces
CUDNN_WORKSPACE 7.06 MB (7408000)
----- Network Information -----
Network # Parameters 1432106
Parameter Memory 5.46 MB (5728424)
Parameter Gradients Memory 5.46 MB (5728424)
Updater Number of Elements 2862812
Updater Memory 10.92 MB (11451248)
Updater Classes:
org.nd4j.linalg.learning.AdamUpdater
org.nd4j.linalg.learning.NoOpUpdater
Params + Gradient + Updater Memory 16.38 MB (17179672)
Iteration Count 400
Epoch Count 0
Backprop Type TruncatedBPTT
TBPTT Length 50/50
Workspace Mode: Training ENABLED
Workspace Mode: Inference ENABLED
Number of Layers 7
Layer Counts
BatchNormalization 2
DenseLayer 1
LSTM 3
OutputLayer 1
Layer Parameter Breakdown
Idx Name Layer Type Layer # Parameters Layer Parameter Memory
1 lstm-1 LSTM 403000 1.54 MB (1612000)
2 lstm-2 LSTM 501000 1.91 MB (2004000)
3 lstm-3 LSTM 501000 1.91 MB (2004000)
5 norm-1 BatchNormalization 1000 3.91 KB (4000)
6 dense-1 DenseLayer 25100 98.05 KB (100400)
7 norm-2 BatchNormalization 400 1.56 KB (1600)
8 output OutputLayer 606 2.37 KB (2424)
----- Layer Helpers - Memory Use -----
# Layer Name Layer Class Helper Class Total Memory Memory Breakdown
5 norm-1 BatchNormalization CudnnBatchNormalizationHelper 1.95 KB (2000) {meanCache=1000, varCache=1000}
7 norm-2 BatchNormalization CudnnBatchNormalizationHelper 800 B {meanCache=400, varCache=400}
Total Helper Count 2
Helper Count w/ Memory 2
Total Helper Persistent Memory Use 2.73 KB (2800)
----- Network Activations: Inferred Activation Shapes -----
Current Minibatch Size 100
Current Input Shape (Input 0) [100, 152, 2000]
Idx Name Layer Type Activations Type Activations Shape # Elements Memory
0 recurrentInput InputVertex InputTypeRecurrent(152,timeSeriesLength=2000) [100, 152, 2000] 30400000 115.97 MB (121600000)
1 lstm-1 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
2 lstm-2 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
3 lstm-3 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
4 thoughtVector LastTimeStepVertex InputTypeFeedForward(250) [100, 250] 25000 97.66 KB (100000)
5 norm-1 BatchNormalization InputTypeFeedForward(250) [100, 250] 25000 97.66 KB (100000)
6 dense-1 DenseLayer InputTypeFeedForward(100) [100, 100] 10000 39.06 KB (40000)
7 norm-2 BatchNormalization InputTypeFeedForward(100) [100, 100] 10000 39.06 KB (40000)
8 output OutputLayer InputTypeFeedForward(6) [100, 6] 600 2.34 KB (2400)
Total Activations Memory 688.44 MB (721882400)
Total Activation Gradient Memory 688.44 MB (721880000)
----- Network Training Listeners -----
Number of Listeners 3
Listener 0 org.x.deeplearning.listener.ScorePrintListener@7b78ed6a
Listener 1 ScoreIterationListener(10)
Listener 2 org.x.deeplearning.listener.UIStatsListener@6fca5907
所以,简短的解释来关闭这个问题。
ND4J 使用堆外内存,基本上映射到 GPU 内存。因此,正如@Samuel Audet 指出的那样,7629M 指的是堆外内存,这显然不适合我的 GTX 750 的 GPU 内存。
来自 DL4J doc 的最后注释:
Note that if your GPU has < 2g of RAM, it’s probably not usable for deep learning.
You should consider using your CPU if this is the case. Typical deep-learning workloads should have 4GB of RAM at minimum. Even that is small. 8GB of RAM on a GPU is recommended for deep learning workloads.
环境:Windows7、GeForce GTX 750、CUDA 10.0、cuDNN 7.4
Maven 依赖项:
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-10.0</artifactId>
<version>1.0.0-beta3</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-10.0</artifactId>
<version>1.0.0-beta3</version>
</dependency>
我每 10 个小批量检查测试性能。我曾经调用 net.evaluate(),但这给了我这个错误:
Exception in thread "AMDSI prefetch thread" java.lang.RuntimeException: java.lang.RuntimeException: Failed to allocate 637074016 bytes from DEVICE [0] memory
at org.deeplearning4j.datasets.iterator.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:396)
Caused by: java.lang.RuntimeException: Failed to allocate 637074016 bytes from DEVICE [0] memory
at org.nd4j.jita.memory.CudaMemoryManager.allocate(CudaMemoryManager.java:76)
at org.nd4j.jita.workspace.CudaWorkspace.init(CudaWorkspace.java:88)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.initializeWorkspace(Nd4jWorkspace.java:508)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:651)
at org.deeplearning4j.datasets.iterator.AsyncMultiDataSetIterator$AsyncPrefetchThread.run(AsyncMultiDataSetIterator.java:372)
然后我从 net.evaluate() 切换到 net.output() 并且 training = false 并将测试集的大小从 100 减少到 20。 这没有错误。我试图将记录数增加到 30,它显示了这个警告,但继续工作:
2019-01-12 14:47:44 WARN org.deeplearning4j.nn.layers.BaseCudnnHelper Cannot allocate 300000000 bytes of device memory (CUDA error = 2), proceeding with host memory
我能理解是显卡内存不足(GeForce GTX 750 Spec显示内存为1G),但是 因为它可以使用主机内存,所以我将测试集大小增加回 100,并因以下错误而永久失败:
2019-01-12 14:59:29 WARN org.deeplearning4j.nn.layers.BaseCudnnHelper Cannot allocate 1000000000 bytes of device memory (CUDA error = 2), proceeding with host memory
Exception in thread "main" 2019-01-12 14:59:39 ERROR org.deeplearning4j.util.CrashReportingUtil >>> Out of Memory Exception Detected. Memory crash dump written to: C:\DATA\Projects\dl4j-language-model\dl4j-memory-crash-dump-1547294372940_1.txt
java.lang.OutOfMemoryError: Failed to allocate memory within limits: totalBytes (470M + 7629M) > maxBytes (7851M)
2019-01-12 14:59:39 WARN org.deeplearning4j.util.CrashReportingUtil Memory crash dump reporting can be disabled with CrashUtil.crashDumpsEnabled(false) or using system property -Dorg.deeplearning4j.crash.reporting.enabled=false
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:580)
at org.deeplearning4j.nn.layers.BaseCudnnHelper$DataCache.<init>(BaseCudnnHelper.java:119)
2019-01-12 14:59:39 WARN org.deeplearning4j.util.CrashReportingUtil Memory crash dump reporting output location can be set with CrashUtil.crashDumpOutputDirectory(File) or using system property -Dorg.deeplearning4j.crash.reporting.directory=<path>
at org.deeplearning4j.nn.layers.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:509)
现在,我假设 maxBytes (7851M)
指的是堆大小(JVM 使用 -Xmx8G -Xms8G 运行),但我还输出 Runtime
freeMemory()
和 totalMemory()
并且它在崩溃之前显示了以下内容,这是足够的可用内存:
2019-01-12 15:29:20 INFO Free memory: 7722607976/8232370176
所以我的问题是,totalBytes (470M + 7629M)
数字从何而来,如果 JVM 内部有空闲内存,为什么不能分配所需的 1G?
下面是内存崩溃报告:
Deeplearning4j OOM Exception Encountered for ComputationGraph
Timestamp: 2019-01-12 14:59:32.940
Thread ID 1
Thread Name main
Stack Trace:
java.lang.OutOfMemoryError: Failed to allocate memory within limits: totalBytes (470M + 7629M) > maxBytes (7851M)
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:580)
at org.deeplearning4j.nn.layers.BaseCudnnHelper$DataCache.<init>(BaseCudnnHelper.java:119)
at org.deeplearning4j.nn.layers.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:509)
at org.deeplearning4j.nn.layers.recurrent.LSTMHelpers.activateHelper(LSTMHelpers.java:205)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activateHelper(LSTM.java:163)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activate(LSTM.java:140)
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward(LayerVertex.java:110)
at org.deeplearning4j.nn.graph.ComputationGraph.outputOfLayersDetached(ComputationGraph.java:2316)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1727)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1686)
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1672)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.testOutputAndScore(CharacterSequenceValuePredictorNet.java:195)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.train(CharacterSequenceValuePredictorNet.java:166)
at org.lungen.deeplearning.net.predictor.CharacterSequenceValuePredictorNet.main(CharacterSequenceValuePredictorNet.java:283)
========== Memory Information ==========
----- Version Information -----
Deeplearning4j Version 1.0.0-beta3
Deeplearning4j CUDA deeplearning4j-cuda-10.0
----- System Information -----
Operating System Microsoft Windows 7 SP1
CPU Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU Cores - Physical 4
CPU Cores - Logical 8
Total System Memory 15.97 GB (17144102912)
Number of GPUs Detected 1
Name CC Total Memory Used Memory Free Memory
GeForce GTX 750 5.0 2 GB (2147483648) 1.67 GB (1795002368) 336.15 MB (352481280)
----- ND4J Environment Information -----
Data Type FLOAT
backend CUDA
blas.vendor CUBLAS
os Windows 7
----- Memory Configuration -----
JVM Memory: XMX 7.67 GB (8232370176)
JVM Memory: current 7.67 GB (8232370176)
JavaCPP Memory: Max Bytes 7.67 GB (8232370176)
JavaCPP Memory: Max Physical 15.33 GB (16464740352)
JavaCPP Memory: Current Bytes 470.26 MB (493106209)
JavaCPP Memory: Current Physical 3.35 GB (3601498112)
Periodic GC Enabled true
Periodic GC Frequency 100 ms
----- Workspace Information -----
Workspaces: # for current thread 4
Current thread workspaces:
Name State Size # Cycles
WS_LAYER_WORKING_MEM CLOSED 117.40 MB (123100000) 6802
WS_ALL_LAYERS_ACT CLOSED 19.41 MB (20349840) 2400
WS_LAYER_ACT_0 CLOSED 6.23 MB (6528000) 1601
WS_LAYER_ACT_1 CLOSED 381.47 MB (400000000) 1601
Workspaces total size 524.50 MB (549977840)
Helper Workspaces
CUDNN_WORKSPACE 7.06 MB (7408000)
----- Network Information -----
Network # Parameters 1432106
Parameter Memory 5.46 MB (5728424)
Parameter Gradients Memory 5.46 MB (5728424)
Updater Number of Elements 2862812
Updater Memory 10.92 MB (11451248)
Updater Classes:
org.nd4j.linalg.learning.AdamUpdater
org.nd4j.linalg.learning.NoOpUpdater
Params + Gradient + Updater Memory 16.38 MB (17179672)
Iteration Count 400
Epoch Count 0
Backprop Type TruncatedBPTT
TBPTT Length 50/50
Workspace Mode: Training ENABLED
Workspace Mode: Inference ENABLED
Number of Layers 7
Layer Counts
BatchNormalization 2
DenseLayer 1
LSTM 3
OutputLayer 1
Layer Parameter Breakdown
Idx Name Layer Type Layer # Parameters Layer Parameter Memory
1 lstm-1 LSTM 403000 1.54 MB (1612000)
2 lstm-2 LSTM 501000 1.91 MB (2004000)
3 lstm-3 LSTM 501000 1.91 MB (2004000)
5 norm-1 BatchNormalization 1000 3.91 KB (4000)
6 dense-1 DenseLayer 25100 98.05 KB (100400)
7 norm-2 BatchNormalization 400 1.56 KB (1600)
8 output OutputLayer 606 2.37 KB (2424)
----- Layer Helpers - Memory Use -----
# Layer Name Layer Class Helper Class Total Memory Memory Breakdown
5 norm-1 BatchNormalization CudnnBatchNormalizationHelper 1.95 KB (2000) {meanCache=1000, varCache=1000}
7 norm-2 BatchNormalization CudnnBatchNormalizationHelper 800 B {meanCache=400, varCache=400}
Total Helper Count 2
Helper Count w/ Memory 2
Total Helper Persistent Memory Use 2.73 KB (2800)
----- Network Activations: Inferred Activation Shapes -----
Current Minibatch Size 100
Current Input Shape (Input 0) [100, 152, 2000]
Idx Name Layer Type Activations Type Activations Shape # Elements Memory
0 recurrentInput InputVertex InputTypeRecurrent(152,timeSeriesLength=2000) [100, 152, 2000] 30400000 115.97 MB (121600000)
1 lstm-1 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
2 lstm-2 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
3 lstm-3 LSTM InputTypeRecurrent(250,timeSeriesLength=2000) [100, 250, 2000] 50000000 190.73 MB (200000000)
4 thoughtVector LastTimeStepVertex InputTypeFeedForward(250) [100, 250] 25000 97.66 KB (100000)
5 norm-1 BatchNormalization InputTypeFeedForward(250) [100, 250] 25000 97.66 KB (100000)
6 dense-1 DenseLayer InputTypeFeedForward(100) [100, 100] 10000 39.06 KB (40000)
7 norm-2 BatchNormalization InputTypeFeedForward(100) [100, 100] 10000 39.06 KB (40000)
8 output OutputLayer InputTypeFeedForward(6) [100, 6] 600 2.34 KB (2400)
Total Activations Memory 688.44 MB (721882400)
Total Activation Gradient Memory 688.44 MB (721880000)
----- Network Training Listeners -----
Number of Listeners 3
Listener 0 org.x.deeplearning.listener.ScorePrintListener@7b78ed6a
Listener 1 ScoreIterationListener(10)
Listener 2 org.x.deeplearning.listener.UIStatsListener@6fca5907
所以,简短的解释来关闭这个问题。 ND4J 使用堆外内存,基本上映射到 GPU 内存。因此,正如@Samuel Audet 指出的那样,7629M 指的是堆外内存,这显然不适合我的 GTX 750 的 GPU 内存。
来自 DL4J doc 的最后注释:
Note that if your GPU has < 2g of RAM, it’s probably not usable for deep learning. You should consider using your CPU if this is the case. Typical deep-learning workloads should have 4GB of RAM at minimum. Even that is small. 8GB of RAM on a GPU is recommended for deep learning workloads.