caffe:"Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM" 训练期间
caffe: "Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM" during training
我正在使用 caffe 进行网络编程,因为我习惯了更舒适的 "lazy" 解决方案,所以我对可能出现的问题有点不知所措。
现在我收到错误
Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
众所周知,这是由错误的 cuda 或 cudnn 版本生成的。
所以我检查了那些,它们是最新的。 (Cuda: 8.0.61 Cudnn: 6.0.21)
因为我只会在添加这个 ReLU 层时出现这个错误,我想这是由于我混淆了一个参数造成的:
layer{
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "relu1"
}
为了向您提供所有信息,这是我收到的错误消息:
I0319 09:41:09.484148 6909 solver.cpp:44] Initializing solver from parameters:
test_iter: 10
test_interval: 1000
base_lr: 0.001
display: 20
max_iter: 800
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.04
stepsize: 200
snapshot: 10000
snapshot_prefix: "models/train"
solver_mode: GPU
net: "train_val.prototxt"
I0319 09:41:09.484392 6909 solver.cpp:87] Creating training net from net file: train_val.prototxt
I0319 09:41:09.485164 6909 net.cpp:294] The NetState phase (0) differed from the phase (1) specified by a rule in layer feed2
I0319 09:41:09.485183 6909 net.cpp:51] Initializing net from parameters:
name: "CaffeNet"
state {
phase: TRAIN
}
layer {
name: "feed"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
hdf5_data_param {
source: "train_h5_list.txt"
batch_size: 50
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 1
kernel_size: 3
stride: 1
weight_filler {
type: "gaussian"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 1
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "relu1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "relu1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 1
kernel_size: 3
stride: 1
weight_filler {
type: "gaussian"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "conv2"
top: "ip2"
param {
lr_mult: 1
decay_mult: 1
}
inner_product_param {
num_output: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "sig1"
type: "Sigmoid"
bottom: "ip2"
top: "sig1"
}
layer {
name: "loss"
type: "EuclideanLoss"
bottom: "sig1"
bottom: "label"
top: "loss"
}
I0319 09:41:09.485752 6909 layer_factory.hpp:77] Creating layer feed
I0319 09:41:09.485780 6909 net.cpp:84] Creating Layer feed
I0319 09:41:09.485792 6909 net.cpp:380] feed -> data
I0319 09:41:09.485819 6909 net.cpp:380] feed -> label
I0319 09:41:09.485836 6909 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: train_h5_list.txt
I0319 09:41:09.485860 6909 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0319 09:41:09.486469 6909 hdf5.cpp:32] Datatype class: H5T_FLOAT
I0319 09:41:09.500986 6909 net.cpp:122] Setting up feed
I0319 09:41:09.501011 6909 net.cpp:129] Top shape: 50 227 227 3 (7729350)
I0319 09:41:09.501027 6909 net.cpp:129] Top shape: 50 1 (50)
I0319 09:41:09.501039 6909 net.cpp:137] Memory required for data: 30917600
I0319 09:41:09.501051 6909 layer_factory.hpp:77] Creating layer conv1
I0319 09:41:09.501080 6909 net.cpp:84] Creating Layer conv1
I0319 09:41:09.501087 6909 net.cpp:406] conv1 <- data
I0319 09:41:09.501101 6909 net.cpp:380] conv1 -> conv1
I0319 09:41:09.880740 6909 net.cpp:122] Setting up conv1
I0319 09:41:09.880765 6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808 6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836 6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846 6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861 6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888 6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899 6909 net.cpp:129] Top shape: 50 1 224 0 (0)
I0319 09:41:09.880913 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880921 6909 layer_factory.hpp:77] Creating layer relu1
I0319 09:41:09.880934 6909 net.cpp:84] Creating Layer relu1
I0319 09:41:09.880941 6909 net.cpp:406] relu1 <- pool1
I0319 09:41:09.880952 6909 net.cpp:380] relu1 -> relu1
F0319 09:41:09.881192 6909 cudnn.hpp:80] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
编辑:尝试将求解器模式设置为 CPU,我仍然收到此错误。
我发现了其中一个问题。
I0319 09:41:09.880765 6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808 6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836 6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846 6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861 6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888 6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899 6909 net.cpp:129] Top shape: 50 1 224 0 (0)
如你所见,第一个卷积层将接受大小为 (50 227 227 3) 的输入,这有点问题,因为他认为第二个维度包含通道。
很自然,这个卷积层会以这种方式简单地削减维度,现在之后的任何层都不会获得正确的输入维度。
我设法通过简单地以这种方式重塑输入来解决问题:
layer {
name: "reshape"
type: "Reshape"
bottom: "data"
top: "res"
reshape_param {
shape {
dim: 50
dim: 3
dim: 227
dim: 227
}
}
}
这里的第一个维度是批量大小,所以无论谁读到这篇文章,都必须记住在分类阶段的 .prototxt 文件中将此暗淡设置为 1(因为这不适用于批量)
编辑:我会将此标记为答案,因为它涵盖了我遇到的问题的基本解决方案,目前看不到其他解决方案。如果有人想对此事有更多的了解,请这样做。
它抛出此错误的原因是因为您没有更多空间 "shrink"。根据您的错误消息:50 1 224 0 (0)
这表明网络的大小在一维中为 0。
要修复此错误,您可以调整一些参数,包括 (S)tride、(K)ernel 大小和 (P)adding。要计算下一层的尺寸 (W_new),您可以使用公式:
W_new = (W_old - K + 2*P)/S + 1
因此,如果我们有一个 227x227x3 的输入,并且我们的第一层有 K = 5、S = 2、P = 1 和 numOutputs = N,那么 conv1 的维度是:
(227-5+2*1)/2 + 1 = 112x112xN。
注意:如果分子最后是奇数,加1后四舍五入。
编辑:它与 ReLU 层一起出现的原因可能是因为 ReLU 层没有任何东西可以通过,因此它会抛出错误。
我正在使用 caffe 进行网络编程,因为我习惯了更舒适的 "lazy" 解决方案,所以我对可能出现的问题有点不知所措。
现在我收到错误
Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
众所周知,这是由错误的 cuda 或 cudnn 版本生成的。 所以我检查了那些,它们是最新的。 (Cuda: 8.0.61 Cudnn: 6.0.21)
因为我只会在添加这个 ReLU 层时出现这个错误,我想这是由于我混淆了一个参数造成的:
layer{
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "relu1"
}
为了向您提供所有信息,这是我收到的错误消息:
I0319 09:41:09.484148 6909 solver.cpp:44] Initializing solver from parameters:
test_iter: 10
test_interval: 1000
base_lr: 0.001
display: 20
max_iter: 800
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.04
stepsize: 200
snapshot: 10000
snapshot_prefix: "models/train"
solver_mode: GPU
net: "train_val.prototxt"
I0319 09:41:09.484392 6909 solver.cpp:87] Creating training net from net file: train_val.prototxt
I0319 09:41:09.485164 6909 net.cpp:294] The NetState phase (0) differed from the phase (1) specified by a rule in layer feed2
I0319 09:41:09.485183 6909 net.cpp:51] Initializing net from parameters:
name: "CaffeNet"
state {
phase: TRAIN
}
layer {
name: "feed"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
hdf5_data_param {
source: "train_h5_list.txt"
batch_size: 50
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 1
kernel_size: 3
stride: 1
weight_filler {
type: "gaussian"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 1
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "relu1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "relu1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 1
kernel_size: 3
stride: 1
weight_filler {
type: "gaussian"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "conv2"
top: "ip2"
param {
lr_mult: 1
decay_mult: 1
}
inner_product_param {
num_output: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "sig1"
type: "Sigmoid"
bottom: "ip2"
top: "sig1"
}
layer {
name: "loss"
type: "EuclideanLoss"
bottom: "sig1"
bottom: "label"
top: "loss"
}
I0319 09:41:09.485752 6909 layer_factory.hpp:77] Creating layer feed
I0319 09:41:09.485780 6909 net.cpp:84] Creating Layer feed
I0319 09:41:09.485792 6909 net.cpp:380] feed -> data
I0319 09:41:09.485819 6909 net.cpp:380] feed -> label
I0319 09:41:09.485836 6909 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: train_h5_list.txt
I0319 09:41:09.485860 6909 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0319 09:41:09.486469 6909 hdf5.cpp:32] Datatype class: H5T_FLOAT
I0319 09:41:09.500986 6909 net.cpp:122] Setting up feed
I0319 09:41:09.501011 6909 net.cpp:129] Top shape: 50 227 227 3 (7729350)
I0319 09:41:09.501027 6909 net.cpp:129] Top shape: 50 1 (50)
I0319 09:41:09.501039 6909 net.cpp:137] Memory required for data: 30917600
I0319 09:41:09.501051 6909 layer_factory.hpp:77] Creating layer conv1
I0319 09:41:09.501080 6909 net.cpp:84] Creating Layer conv1
I0319 09:41:09.501087 6909 net.cpp:406] conv1 <- data
I0319 09:41:09.501101 6909 net.cpp:380] conv1 -> conv1
I0319 09:41:09.880740 6909 net.cpp:122] Setting up conv1
I0319 09:41:09.880765 6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808 6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836 6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846 6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861 6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888 6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899 6909 net.cpp:129] Top shape: 50 1 224 0 (0)
I0319 09:41:09.880913 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880921 6909 layer_factory.hpp:77] Creating layer relu1
I0319 09:41:09.880934 6909 net.cpp:84] Creating Layer relu1
I0319 09:41:09.880941 6909 net.cpp:406] relu1 <- pool1
I0319 09:41:09.880952 6909 net.cpp:380] relu1 -> relu1
F0319 09:41:09.881192 6909 cudnn.hpp:80] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
编辑:尝试将求解器模式设置为 CPU,我仍然收到此错误。
我发现了其中一个问题。
I0319 09:41:09.880765 6909 net.cpp:129] Top shape: 50 1 225 1 (11250)
I0319 09:41:09.880781 6909 net.cpp:137] Memory required for data: 30962600
I0319 09:41:09.880808 6909 layer_factory.hpp:77] Creating layer pool1
I0319 09:41:09.880836 6909 net.cpp:84] Creating Layer pool1
I0319 09:41:09.880846 6909 net.cpp:406] pool1 <- conv1
I0319 09:41:09.880861 6909 net.cpp:380] pool1 -> pool1
I0319 09:41:09.880888 6909 net.cpp:122] Setting up pool1
I0319 09:41:09.880899 6909 net.cpp:129] Top shape: 50 1 224 0 (0)
如你所见,第一个卷积层将接受大小为 (50 227 227 3) 的输入,这有点问题,因为他认为第二个维度包含通道。
很自然,这个卷积层会以这种方式简单地削减维度,现在之后的任何层都不会获得正确的输入维度。
我设法通过简单地以这种方式重塑输入来解决问题:
layer {
name: "reshape"
type: "Reshape"
bottom: "data"
top: "res"
reshape_param {
shape {
dim: 50
dim: 3
dim: 227
dim: 227
}
}
}
这里的第一个维度是批量大小,所以无论谁读到这篇文章,都必须记住在分类阶段的 .prototxt 文件中将此暗淡设置为 1(因为这不适用于批量)
编辑:我会将此标记为答案,因为它涵盖了我遇到的问题的基本解决方案,目前看不到其他解决方案。如果有人想对此事有更多的了解,请这样做。
它抛出此错误的原因是因为您没有更多空间 "shrink"。根据您的错误消息:50 1 224 0 (0) 这表明网络的大小在一维中为 0。
要修复此错误,您可以调整一些参数,包括 (S)tride、(K)ernel 大小和 (P)adding。要计算下一层的尺寸 (W_new),您可以使用公式:
W_new = (W_old - K + 2*P)/S + 1
因此,如果我们有一个 227x227x3 的输入,并且我们的第一层有 K = 5、S = 2、P = 1 和 numOutputs = N,那么 conv1 的维度是:
(227-5+2*1)/2 + 1 = 112x112xN。
注意:如果分子最后是奇数,加1后四舍五入。
编辑:它与 ReLU 层一起出现的原因可能是因为 ReLU 层没有任何东西可以通过,因此它会抛出错误。