中止 Caffe 培训 - 无错误消息
Aborted Caffe training - no error message
我想用 caffe 训练我的网络,但不幸的是,当我尝试 运行 train.sh 时,该过程中止显示没有特定错误信息。
我已经创建了我的预训练权重,我的 model.prototxt 和我检查过的 LMDB 数据库是否正常。所以这是我的整个控制台输出(由于字符限制,只有有趣的部分):
I0504 06:37:33.873118 50237 caffe.cpp:210] Use CPU.
I0504 06:37:33.874349 50237 solver.cpp:63] Initializing solver from parameters:
train_net: "example/MobileNetSSD_train.prototxt"
test_net: "example/MobileNetSSD_test.prototxt"
test_iter: 673
test_interval: 10000
base_lr: 0.0005
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 5e-05
snapshot: 1000
snapshot_prefix: "snapshot/mobilenet"
solver_mode: CPU
debug_info: false
train_state {
level: 0
stage: ""
}
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 20000
stepvalue: 40000
iter_size: 1
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
I0504 06:37:33.875725 50237 solver.cpp:96] Creating training net from train_net file: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876616 50237 upgrade_proto.cpp:77] Attempting to upgrade batch norm layers using deprecated params: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876662 50237 upgrade_proto.cpp:80] Successfully upgraded batch norm layers using deprecated params.
I0504 06:37:33.876909 50237 net.cpp:58] Initializing net from parameters:
name: "MobileNet-SSD"
state {
phase: TRAIN
level: 0
stage: ""
}
layer {
name: "data"
type: "AnnotatedData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.007843
mirror: true
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
resize_param {
prob: 1
resize_mode: WARP
height: 300
width: 300
interp_mode: LINEAR
interp_mode: AREA
interp_mode: NEAREST
interp_mode: CUBIC
interp_mode: LANCZOS4
}
emit_constraint {
emit_type: CENTER
}
distort_param {
brightness_prob: 0.5
brightness_delta: 32
contrast_prob: 0.5
contrast_lower: 0.5
contrast_upper: 1.5
hue_prob: 0.5
hue_delta: 18
saturation_prob: 0.5
saturation_lower: 0.5
saturation_upper: 1.5
random_order_prob: 0
}
expand_param {
prob: 0.5
max_expand_ratio: 4
}
}
data_param {
source: "trainval_lmdb/"
batch_size: 24
backend: LMDB
}
annotated_data_param {
batch_sampler {
max_sample: 1
max_trials: 1
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.1
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.3
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.5
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.7
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.9
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
max_jaccard_overlap: 1
}
max_sample: 1
max_trials: 50
}
label_map_file: "labelmap.prototxt"
}
}
layer {
name: "conv0"
type: "Convolution"
bottom: "data"
top: "conv0"
param {
lr_mult: 0.1
decay_mult: 0.1
}
convolution_param {
num_output: 32
bias_term: false
pad: 1
kernel_size: 3
stride: 2
weight_filler {
type: "msra"
}
}
}
layer {
name: "conv0/bn"
type: "BatchNorm"
bottom: "conv0"
top: "conv0"
}
layer {
name: "conv0/scale"
type: "Scale"
bottom: "conv0"
top: "conv0"
param {
lr_mult: 0.1
decay_mult: 0
}
param {
lr_mult: 0.2
decay_mult: 0
}
scale_param {
filler {
value: 1
}
bias_term: true
bias_filler {
value: 0
}
}
}
layer {
name: "conv0/relu"
type: "ReLU"
bottom: "conv0"
top: "conv0"
}
layer {
name: "conv1/dw"
type: "Convolution"
bottom: "conv0"
top: "conv1/dw"
param {
lr_mult: 0.1
decay_mult: 0.1
}
convolution_param {
num_output: 32
bias_term: false
pad: 1
kernel_size: 3
group: 32
weight_filler {
type: "msra"
}
engine: CAFFE
}
}
layer {
name: "conv1/dw/bn"
type: "BatchNorm"
bottom: "conv1/dw"
top: "conv1/dw"
}
layer {
name: "conv1/dw/scale"
type: "Scale"
bottom: "conv1/dw"
top: "conv1/dw"
param {
lr_mult: 0.1
decay_mult: 0
}
param {
lr_mult: 0.2
decay_mult: 0
}
scale_param {
filler {
value: 1
}
bias_term: true
bias_filler {
value: 0
}
}
}
[...]
layer {
name: "conv17_2/relu"
type: "ReLU"
bottom: "conv17_2"
top: "conv17_2"
}
layer {
name: "conv11_mbox_loc"
type: "Convolution"
bottom: "conv11"
top: "conv11_mbox_loc"
param {
lr_mult: 0.1
decay_mult: 0.1
}
param {
lr_mult: 0.2
decay_mult: 0
}
convolution_param {
num_output: 12
kernel_size: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "conv11_mbox_loc_perm"
type: "Permute"
bottom: "conv11_mbox_loc"
top: "conv11_mbox_loc_perm"
permute_param {
order: 0
or
I0504 06:37:33.890111 50237 layer_factory.hpp:77] Creating layer data
I0504 06:37:33.890482 50237 net.cpp:100] Creating Layer data
I0504 06:37:33.890534 50237 net.cpp:408] data -> data
I0504 06:37:33.890727 50239 db_lmdb.cpp:35] Opened lmdb trainval_lmdb/
I0504 06:37:33.891376 50237 net.cpp:408] data -> label
I0504 06:37:33.895253 50237 annotated_data_layer.cpp:62] output data size: 24,3,300,300
I0504 06:37:33.895355 50237 net.cpp:150] Setting up data
I0504 06:37:33.895393 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.895494 50237 net.cpp:157] Top shape: 1 1 1 8 (8)
I0504 06:37:33.895525 50237 net.cpp:165] Memory required for data: 25920032
I0504 06:37:33.895558 50237 layer_factory.hpp:77] Creating layer data_data_0_split
I0504 06:37:33.895594 50237 net.cpp:100] Creating Layer data_data_0_split
I0504 06:37:33.895627 50237 net.cpp:434] data_data_0_split <- data
I0504 06:37:33.895660 50237 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0504 06:37:33.895694 50237 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0504 06:37:33.895726 50237 net.cpp:408] data_data_0_split -> data_data_0_split_2
I0504 06:37:33.895757 50237 net.cpp:408] data_data_0_split -> data_data_0_split_3
I0504 06:37:33.895817 50237 net.cpp:408] data_data_0_split -> data_data_0_split_4
I0504 06:37:33.895853 50237 net.cpp:408] data_data_0_split -> data_data_0_split_5
I0504 06:37:33.895884 50237 net.cpp:408] data_data_0_split -> data_data_0_split_6
I0504 06:37:33.895965 50237 net.cpp:150] Setting up data_data_0_split
I0504 06:37:33.896008 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896039 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896068 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896113 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896143 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896173 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896201 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896230 50237 net.cpp:165] Memory required for data: 207360032
I0504 06:37:33.896277 50237 layer_factory.hpp:77] Creating layer conv0
I0504 06:37:33.896404 50237 net.cpp:100] Creating Layer conv0
I0504 06:37:33.896438 50237 net.cpp:434] conv0 <- data_data_0_split_0
I0504 06:37:33.896469 50237 net.cpp:408] conv0 -> conv0
I0504 06:37:33.897195 50237 net.cpp:150] Setting up conv0
I0504 06:37:33.897239 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897289 50237 net.cpp:165] Memory required for data: 276480032
I0504 06:37:33.897328 50237 layer_factory.hpp:77] Creating layer conv0/bn
I0504 06:37:33.897364 50237 net.cpp:100] Creating Layer conv0/bn
I0504 06:37:33.897394 50237 net.cpp:434] conv0/bn <- conv0
I0504 06:37:33.897423 50237 net.cpp:395] conv0/bn -> conv0 (in-place)
I0504 06:37:33.897517 50237 net.cpp:150] Setting up conv0/bn
I0504 06:37:33.897550 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897580 50237 net.cpp:165] Memory required for data: 345600032
I0504 06:37:33.897611 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.897644 50237 net.cpp:100] Creating Layer conv0/scale
I0504 06:37:33.897672 50237 net.cpp:434] conv0/scale <- conv0
I0504 06:37:33.897701 50237 net.cpp:395] conv0/scale -> conv0 (in-place)
I0504 06:37:33.898386 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.898525 50237 net.cpp:150] Setting up conv0/scale
I0504 06:37:33.898561 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898591 50237 net.cpp:165] Memory required for data: 414720032
I0504 06:37:33.898622 50237 layer_factory.hpp:77] Creating layer conv0/relu
I0504 06:37:33.898654 50237 net.cpp:100] Creating Layer conv0/relu
I0504 06:37:33.898684 50237 net.cpp:434] conv0/relu <- conv0
I0504 06:37:33.898712 50237 net.cpp:395] conv0/relu -> conv0 (in-place)
I0504 06:37:33.898746 50237 net.cpp:150] Setting up conv0/relu
I0504 06:37:33.898777 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898805 50237 net.cpp:165] Memory required for data: 483840032
I0504 06:37:33.898833 50237 layer_factory.hpp:77] Creating layer conv1/dw
I0504 06:37:33.898864 50237 net.cpp:100] Creating Layer conv1/dw
I0504 06:37:33.898893 50237 net.cpp:434] conv1/dw <- conv0
I0504 06:37:33.898922 50237 net.cpp:408] conv1/dw -> conv1/dw
I0504 06:37:33.898962 50237 net.cpp:150] Setting up conv1/dw
I0504 06:37:33.898993 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.899021 50237 net.cpp:165] Memory required for data: 552960032
I0504 06:37:33.899050 50237 layer_factory.hpp:77] Creating layer conv1/dw/bn
[...]
I0504 06:37:33.985625 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.985718 50237 net.cpp:100] Creating Layer conv13/dw/scale
@ 0x7f192267c2c0 caffe::GenerateBatchSamples()
I0504 06:37:33.987087 50237 net.cpp:434] conv13/dw/scale <- conv13/dw
I0504 06:37:33.987202 50237 net.cpp:395] conv13/dw/scale -> conv13/dw (in-place)
I0504 06:37:33.987262 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.987337 50237 net.cpp:150] Setting up conv13/dw/scale
I0504 06:37:33.987366 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987393 50237 net.cpp:165] Memory required for data: 3753455648
I0504 06:37:33.987419 50237 layer_factory.hpp:77] Creating layer conv13/dw/relu
I0504 06:37:33.987447 50237 net.cpp:100] Creating Layer conv13/dw/relu
I0504 06:37:33.987470 50237 net.cpp:434] conv13/dw/relu <- conv13/dw
I0504 06:37:33.987504 50237 net.cpp:395] conv13/dw/relu -> conv13/dw (in-place)
I0504 06:37:33.987534 50237 net.cpp:150] Setting up conv13/dw/relu
I0504 06:37:33.987557 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987582 50237 net.cpp:165] Memory required for data: 3763286048
I0504 06:37:33.987607 50237 layer_factory.hpp:77] Creating layer conv13
I0504 06:37:33.987639 50237 net.cpp:100] Creating Layer conv13
I0504 06:37:33.987665 50237 net.cpp:434] conv13 <- conv13/dw
I0504 06:37:33.987691 50237 net.cpp:408] conv13 -> conv13
@ 0x7f19226dc732 caffe::AnnotatedDataLayer<>::load_batch()
@ 0x7f19226e000a caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7f191ec9fbcd (unknown)
@ 0x7f191c4326db start_thread
@ 0x7f19210eb88f clone
Aborted (core dumped)
我想这可能是内存问题,因为它在 buildinf conv 层中失败(我正在 CPU 上训练),但我的批量大小已经是 24。有谁知道究竟是什么导致了这个问题以及如何解决?
谢谢!
在这个问题上花了很多时间并尝试了无数的解决方案之后,我终于找到了导致这个问题的原因。这个错误特别危险,因为在大多数情况下,它决定简单地不给出错误信息。
在此处查看原始主题:https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120
在编译之前,您必须稍微编辑一下源代码。转到 caffe/src/caffe/util/math_functions.cpp 并在第 247 行找到此函数,您应该将其编辑为如下所示:
void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
CHECK_GE(n, 0);
CHECK(r);
if (a > b) {
Dtype c = a;
a = b;
b = c;
}
CHECK_LE(a, b);
boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
variate_generator(caffe_rng(), random_distribution);
for (int i = 0; i < n; ++i) {
r[i] = variate_generator();
}
}
请注意,我刚刚添加了一个 if 语句(切换变量 a 和 b if a 大于 b) 并从 Dtype a 的参数行中删除了 const 标志和 D 类型 b。
然后简单地做:
make clean
make -j$(nproc)
make py -j$(nproc)
make test -j$(nproc)
make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.
对我来说,这很管用!
我想用 caffe 训练我的网络,但不幸的是,当我尝试 运行 train.sh 时,该过程中止显示没有特定错误信息。 我已经创建了我的预训练权重,我的 model.prototxt 和我检查过的 LMDB 数据库是否正常。所以这是我的整个控制台输出(由于字符限制,只有有趣的部分):
I0504 06:37:33.873118 50237 caffe.cpp:210] Use CPU.
I0504 06:37:33.874349 50237 solver.cpp:63] Initializing solver from parameters:
train_net: "example/MobileNetSSD_train.prototxt"
test_net: "example/MobileNetSSD_test.prototxt"
test_iter: 673
test_interval: 10000
base_lr: 0.0005
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 5e-05
snapshot: 1000
snapshot_prefix: "snapshot/mobilenet"
solver_mode: CPU
debug_info: false
train_state {
level: 0
stage: ""
}
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 20000
stepvalue: 40000
iter_size: 1
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
I0504 06:37:33.875725 50237 solver.cpp:96] Creating training net from train_net file: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876616 50237 upgrade_proto.cpp:77] Attempting to upgrade batch norm layers using deprecated params: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876662 50237 upgrade_proto.cpp:80] Successfully upgraded batch norm layers using deprecated params.
I0504 06:37:33.876909 50237 net.cpp:58] Initializing net from parameters:
name: "MobileNet-SSD"
state {
phase: TRAIN
level: 0
stage: ""
}
layer {
name: "data"
type: "AnnotatedData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.007843
mirror: true
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
resize_param {
prob: 1
resize_mode: WARP
height: 300
width: 300
interp_mode: LINEAR
interp_mode: AREA
interp_mode: NEAREST
interp_mode: CUBIC
interp_mode: LANCZOS4
}
emit_constraint {
emit_type: CENTER
}
distort_param {
brightness_prob: 0.5
brightness_delta: 32
contrast_prob: 0.5
contrast_lower: 0.5
contrast_upper: 1.5
hue_prob: 0.5
hue_delta: 18
saturation_prob: 0.5
saturation_lower: 0.5
saturation_upper: 1.5
random_order_prob: 0
}
expand_param {
prob: 0.5
max_expand_ratio: 4
}
}
data_param {
source: "trainval_lmdb/"
batch_size: 24
backend: LMDB
}
annotated_data_param {
batch_sampler {
max_sample: 1
max_trials: 1
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.1
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.3
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.5
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.7
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
min_jaccard_overlap: 0.9
}
max_sample: 1
max_trials: 50
}
batch_sampler {
sampler {
min_scale: 0.3
max_scale: 1
min_aspect_ratio: 0.5
max_aspect_ratio: 2
}
sample_constraint {
max_jaccard_overlap: 1
}
max_sample: 1
max_trials: 50
}
label_map_file: "labelmap.prototxt"
}
}
layer {
name: "conv0"
type: "Convolution"
bottom: "data"
top: "conv0"
param {
lr_mult: 0.1
decay_mult: 0.1
}
convolution_param {
num_output: 32
bias_term: false
pad: 1
kernel_size: 3
stride: 2
weight_filler {
type: "msra"
}
}
}
layer {
name: "conv0/bn"
type: "BatchNorm"
bottom: "conv0"
top: "conv0"
}
layer {
name: "conv0/scale"
type: "Scale"
bottom: "conv0"
top: "conv0"
param {
lr_mult: 0.1
decay_mult: 0
}
param {
lr_mult: 0.2
decay_mult: 0
}
scale_param {
filler {
value: 1
}
bias_term: true
bias_filler {
value: 0
}
}
}
layer {
name: "conv0/relu"
type: "ReLU"
bottom: "conv0"
top: "conv0"
}
layer {
name: "conv1/dw"
type: "Convolution"
bottom: "conv0"
top: "conv1/dw"
param {
lr_mult: 0.1
decay_mult: 0.1
}
convolution_param {
num_output: 32
bias_term: false
pad: 1
kernel_size: 3
group: 32
weight_filler {
type: "msra"
}
engine: CAFFE
}
}
layer {
name: "conv1/dw/bn"
type: "BatchNorm"
bottom: "conv1/dw"
top: "conv1/dw"
}
layer {
name: "conv1/dw/scale"
type: "Scale"
bottom: "conv1/dw"
top: "conv1/dw"
param {
lr_mult: 0.1
decay_mult: 0
}
param {
lr_mult: 0.2
decay_mult: 0
}
scale_param {
filler {
value: 1
}
bias_term: true
bias_filler {
value: 0
}
}
}
[...]
layer {
name: "conv17_2/relu"
type: "ReLU"
bottom: "conv17_2"
top: "conv17_2"
}
layer {
name: "conv11_mbox_loc"
type: "Convolution"
bottom: "conv11"
top: "conv11_mbox_loc"
param {
lr_mult: 0.1
decay_mult: 0.1
}
param {
lr_mult: 0.2
decay_mult: 0
}
convolution_param {
num_output: 12
kernel_size: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "conv11_mbox_loc_perm"
type: "Permute"
bottom: "conv11_mbox_loc"
top: "conv11_mbox_loc_perm"
permute_param {
order: 0
or
I0504 06:37:33.890111 50237 layer_factory.hpp:77] Creating layer data
I0504 06:37:33.890482 50237 net.cpp:100] Creating Layer data
I0504 06:37:33.890534 50237 net.cpp:408] data -> data
I0504 06:37:33.890727 50239 db_lmdb.cpp:35] Opened lmdb trainval_lmdb/
I0504 06:37:33.891376 50237 net.cpp:408] data -> label
I0504 06:37:33.895253 50237 annotated_data_layer.cpp:62] output data size: 24,3,300,300
I0504 06:37:33.895355 50237 net.cpp:150] Setting up data
I0504 06:37:33.895393 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.895494 50237 net.cpp:157] Top shape: 1 1 1 8 (8)
I0504 06:37:33.895525 50237 net.cpp:165] Memory required for data: 25920032
I0504 06:37:33.895558 50237 layer_factory.hpp:77] Creating layer data_data_0_split
I0504 06:37:33.895594 50237 net.cpp:100] Creating Layer data_data_0_split
I0504 06:37:33.895627 50237 net.cpp:434] data_data_0_split <- data
I0504 06:37:33.895660 50237 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0504 06:37:33.895694 50237 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0504 06:37:33.895726 50237 net.cpp:408] data_data_0_split -> data_data_0_split_2
I0504 06:37:33.895757 50237 net.cpp:408] data_data_0_split -> data_data_0_split_3
I0504 06:37:33.895817 50237 net.cpp:408] data_data_0_split -> data_data_0_split_4
I0504 06:37:33.895853 50237 net.cpp:408] data_data_0_split -> data_data_0_split_5
I0504 06:37:33.895884 50237 net.cpp:408] data_data_0_split -> data_data_0_split_6
I0504 06:37:33.895965 50237 net.cpp:150] Setting up data_data_0_split
I0504 06:37:33.896008 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896039 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896068 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896113 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896143 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896173 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896201 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896230 50237 net.cpp:165] Memory required for data: 207360032
I0504 06:37:33.896277 50237 layer_factory.hpp:77] Creating layer conv0
I0504 06:37:33.896404 50237 net.cpp:100] Creating Layer conv0
I0504 06:37:33.896438 50237 net.cpp:434] conv0 <- data_data_0_split_0
I0504 06:37:33.896469 50237 net.cpp:408] conv0 -> conv0
I0504 06:37:33.897195 50237 net.cpp:150] Setting up conv0
I0504 06:37:33.897239 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897289 50237 net.cpp:165] Memory required for data: 276480032
I0504 06:37:33.897328 50237 layer_factory.hpp:77] Creating layer conv0/bn
I0504 06:37:33.897364 50237 net.cpp:100] Creating Layer conv0/bn
I0504 06:37:33.897394 50237 net.cpp:434] conv0/bn <- conv0
I0504 06:37:33.897423 50237 net.cpp:395] conv0/bn -> conv0 (in-place)
I0504 06:37:33.897517 50237 net.cpp:150] Setting up conv0/bn
I0504 06:37:33.897550 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897580 50237 net.cpp:165] Memory required for data: 345600032
I0504 06:37:33.897611 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.897644 50237 net.cpp:100] Creating Layer conv0/scale
I0504 06:37:33.897672 50237 net.cpp:434] conv0/scale <- conv0
I0504 06:37:33.897701 50237 net.cpp:395] conv0/scale -> conv0 (in-place)
I0504 06:37:33.898386 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.898525 50237 net.cpp:150] Setting up conv0/scale
I0504 06:37:33.898561 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898591 50237 net.cpp:165] Memory required for data: 414720032
I0504 06:37:33.898622 50237 layer_factory.hpp:77] Creating layer conv0/relu
I0504 06:37:33.898654 50237 net.cpp:100] Creating Layer conv0/relu
I0504 06:37:33.898684 50237 net.cpp:434] conv0/relu <- conv0
I0504 06:37:33.898712 50237 net.cpp:395] conv0/relu -> conv0 (in-place)
I0504 06:37:33.898746 50237 net.cpp:150] Setting up conv0/relu
I0504 06:37:33.898777 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898805 50237 net.cpp:165] Memory required for data: 483840032
I0504 06:37:33.898833 50237 layer_factory.hpp:77] Creating layer conv1/dw
I0504 06:37:33.898864 50237 net.cpp:100] Creating Layer conv1/dw
I0504 06:37:33.898893 50237 net.cpp:434] conv1/dw <- conv0
I0504 06:37:33.898922 50237 net.cpp:408] conv1/dw -> conv1/dw
I0504 06:37:33.898962 50237 net.cpp:150] Setting up conv1/dw
I0504 06:37:33.898993 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.899021 50237 net.cpp:165] Memory required for data: 552960032
I0504 06:37:33.899050 50237 layer_factory.hpp:77] Creating layer conv1/dw/bn
[...]
I0504 06:37:33.985625 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.985718 50237 net.cpp:100] Creating Layer conv13/dw/scale
@ 0x7f192267c2c0 caffe::GenerateBatchSamples()
I0504 06:37:33.987087 50237 net.cpp:434] conv13/dw/scale <- conv13/dw
I0504 06:37:33.987202 50237 net.cpp:395] conv13/dw/scale -> conv13/dw (in-place)
I0504 06:37:33.987262 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.987337 50237 net.cpp:150] Setting up conv13/dw/scale
I0504 06:37:33.987366 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987393 50237 net.cpp:165] Memory required for data: 3753455648
I0504 06:37:33.987419 50237 layer_factory.hpp:77] Creating layer conv13/dw/relu
I0504 06:37:33.987447 50237 net.cpp:100] Creating Layer conv13/dw/relu
I0504 06:37:33.987470 50237 net.cpp:434] conv13/dw/relu <- conv13/dw
I0504 06:37:33.987504 50237 net.cpp:395] conv13/dw/relu -> conv13/dw (in-place)
I0504 06:37:33.987534 50237 net.cpp:150] Setting up conv13/dw/relu
I0504 06:37:33.987557 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987582 50237 net.cpp:165] Memory required for data: 3763286048
I0504 06:37:33.987607 50237 layer_factory.hpp:77] Creating layer conv13
I0504 06:37:33.987639 50237 net.cpp:100] Creating Layer conv13
I0504 06:37:33.987665 50237 net.cpp:434] conv13 <- conv13/dw
I0504 06:37:33.987691 50237 net.cpp:408] conv13 -> conv13
@ 0x7f19226dc732 caffe::AnnotatedDataLayer<>::load_batch()
@ 0x7f19226e000a caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7f191ec9fbcd (unknown)
@ 0x7f191c4326db start_thread
@ 0x7f19210eb88f clone
Aborted (core dumped)
我想这可能是内存问题,因为它在 buildinf conv 层中失败(我正在 CPU 上训练),但我的批量大小已经是 24。有谁知道究竟是什么导致了这个问题以及如何解决? 谢谢!
在这个问题上花了很多时间并尝试了无数的解决方案之后,我终于找到了导致这个问题的原因。这个错误特别危险,因为在大多数情况下,它决定简单地不给出错误信息。
在此处查看原始主题:https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120
在编译之前,您必须稍微编辑一下源代码。转到 caffe/src/caffe/util/math_functions.cpp 并在第 247 行找到此函数,您应该将其编辑为如下所示:
void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
CHECK_GE(n, 0);
CHECK(r);
if (a > b) {
Dtype c = a;
a = b;
b = c;
}
CHECK_LE(a, b);
boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
variate_generator(caffe_rng(), random_distribution);
for (int i = 0; i < n; ++i) {
r[i] = variate_generator();
}
}
请注意,我刚刚添加了一个 if 语句(切换变量 a 和 b if a 大于 b) 并从 Dtype a 的参数行中删除了 const 标志和 D 类型 b。 然后简单地做:
make clean
make -j$(nproc)
make py -j$(nproc)
make test -j$(nproc)
make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.
对我来说,这很管用!