中止 Caffe 培训 - 无错误消息

Question

我想用 caffe 训练我的网络，但不幸的是，当我尝试运行 train.sh 时，该过程中止显示没有特定错误信息。我已经创建了我的预训练权重，我的 model.prototxt 和我检查过的 LMDB 数据库是否正常。所以这是我的整个控制台输出（由于字符限制，只有有趣的部分）：

I0504 06:37:33.873118 50237 caffe.cpp:210] Use CPU.
I0504 06:37:33.874349 50237 solver.cpp:63] Initializing solver from parameters: 
train_net: "example/MobileNetSSD_train.prototxt"
test_net: "example/MobileNetSSD_test.prototxt"
test_iter: 673
test_interval: 10000
base_lr: 0.0005
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 5e-05
snapshot: 1000
snapshot_prefix: "snapshot/mobilenet"
solver_mode: CPU
debug_info: false
train_state {
  level: 0
  stage: ""
}
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 20000
stepvalue: 40000
iter_size: 1
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
I0504 06:37:33.875725 50237 solver.cpp:96] Creating training net from train_net file: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876616 50237 upgrade_proto.cpp:77] Attempting to upgrade batch norm layers using deprecated params: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876662 50237 upgrade_proto.cpp:80] Successfully upgraded batch norm layers using deprecated params.
I0504 06:37:33.876909 50237 net.cpp:58] Initializing net from parameters: 
name: "MobileNet-SSD"
state {
  phase: TRAIN
  level: 0
  stage: ""
}
layer {
  name: "data"
  type: "AnnotatedData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.007843
    mirror: true
    mean_value: 127.5
    mean_value: 127.5
    mean_value: 127.5
    resize_param {
      prob: 1
      resize_mode: WARP
      height: 300
      width: 300
      interp_mode: LINEAR
      interp_mode: AREA
      interp_mode: NEAREST
      interp_mode: CUBIC
      interp_mode: LANCZOS4
    }
    emit_constraint {
      emit_type: CENTER
    }
    distort_param {
      brightness_prob: 0.5
      brightness_delta: 32
      contrast_prob: 0.5
      contrast_lower: 0.5
      contrast_upper: 1.5
      hue_prob: 0.5
      hue_delta: 18
      saturation_prob: 0.5
      saturation_lower: 0.5
      saturation_upper: 1.5
      random_order_prob: 0
    }
    expand_param {
      prob: 0.5
      max_expand_ratio: 4
    }
  }
  data_param {
    source: "trainval_lmdb/"
    batch_size: 24
    backend: LMDB
  }
  annotated_data_param {
    batch_sampler {
      max_sample: 1
      max_trials: 1
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.1
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.3
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.5
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.7
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.9
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        max_jaccard_overlap: 1
      }
      max_sample: 1
      max_trials: 50
    }
    label_map_file: "labelmap.prototxt"
  }
}
layer {
  name: "conv0"
  type: "Convolution"
  bottom: "data"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    stride: 2
    weight_filler {
      type: "msra"
    }
  }
}
layer {
  name: "conv0/bn"
  type: "BatchNorm"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv0/scale"
  type: "Scale"
  bottom: "conv0"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}
layer {
  name: "conv0/relu"
  type: "ReLU"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv1/dw"
  type: "Convolution"
  bottom: "conv0"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    group: 32
    weight_filler {
      type: "msra"
    }
    engine: CAFFE
  }
}
layer {
  name: "conv1/dw/bn"
  type: "BatchNorm"
  bottom: "conv1/dw"
  top: "conv1/dw"
}
layer {
  name: "conv1/dw/scale"
  type: "Scale"
  bottom: "conv1/dw"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}

[...]

layer {
  name: "conv17_2/relu"
  type: "ReLU"
  bottom: "conv17_2"
  top: "conv17_2"
}
layer {
  name: "conv11_mbox_loc"
  type: "Convolution"
  bottom: "conv11"
  top: "conv11_mbox_loc"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  convolution_param {
    num_output: 12
    kernel_size: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "conv11_mbox_loc_perm"
  type: "Permute"
  bottom: "conv11_mbox_loc"
  top: "conv11_mbox_loc_perm"
  permute_param {
    order: 0
    or
I0504 06:37:33.890111 50237 layer_factory.hpp:77] Creating layer data
I0504 06:37:33.890482 50237 net.cpp:100] Creating Layer data
I0504 06:37:33.890534 50237 net.cpp:408] data -> data
I0504 06:37:33.890727 50239 db_lmdb.cpp:35] Opened lmdb trainval_lmdb/
I0504 06:37:33.891376 50237 net.cpp:408] data -> label
I0504 06:37:33.895253 50237 annotated_data_layer.cpp:62] output data size: 24,3,300,300
I0504 06:37:33.895355 50237 net.cpp:150] Setting up data
I0504 06:37:33.895393 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.895494 50237 net.cpp:157] Top shape: 1 1 1 8 (8)
I0504 06:37:33.895525 50237 net.cpp:165] Memory required for data: 25920032
I0504 06:37:33.895558 50237 layer_factory.hpp:77] Creating layer data_data_0_split
I0504 06:37:33.895594 50237 net.cpp:100] Creating Layer data_data_0_split
I0504 06:37:33.895627 50237 net.cpp:434] data_data_0_split <- data
I0504 06:37:33.895660 50237 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0504 06:37:33.895694 50237 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0504 06:37:33.895726 50237 net.cpp:408] data_data_0_split -> data_data_0_split_2
I0504 06:37:33.895757 50237 net.cpp:408] data_data_0_split -> data_data_0_split_3
I0504 06:37:33.895817 50237 net.cpp:408] data_data_0_split -> data_data_0_split_4
I0504 06:37:33.895853 50237 net.cpp:408] data_data_0_split -> data_data_0_split_5
I0504 06:37:33.895884 50237 net.cpp:408] data_data_0_split -> data_data_0_split_6
I0504 06:37:33.895965 50237 net.cpp:150] Setting up data_data_0_split
I0504 06:37:33.896008 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896039 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896068 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896113 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896143 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896173 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896201 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896230 50237 net.cpp:165] Memory required for data: 207360032
I0504 06:37:33.896277 50237 layer_factory.hpp:77] Creating layer conv0
I0504 06:37:33.896404 50237 net.cpp:100] Creating Layer conv0
I0504 06:37:33.896438 50237 net.cpp:434] conv0 <- data_data_0_split_0
I0504 06:37:33.896469 50237 net.cpp:408] conv0 -> conv0
I0504 06:37:33.897195 50237 net.cpp:150] Setting up conv0
I0504 06:37:33.897239 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897289 50237 net.cpp:165] Memory required for data: 276480032
I0504 06:37:33.897328 50237 layer_factory.hpp:77] Creating layer conv0/bn
I0504 06:37:33.897364 50237 net.cpp:100] Creating Layer conv0/bn
I0504 06:37:33.897394 50237 net.cpp:434] conv0/bn <- conv0
I0504 06:37:33.897423 50237 net.cpp:395] conv0/bn -> conv0 (in-place)
I0504 06:37:33.897517 50237 net.cpp:150] Setting up conv0/bn
I0504 06:37:33.897550 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897580 50237 net.cpp:165] Memory required for data: 345600032
I0504 06:37:33.897611 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.897644 50237 net.cpp:100] Creating Layer conv0/scale
I0504 06:37:33.897672 50237 net.cpp:434] conv0/scale <- conv0
I0504 06:37:33.897701 50237 net.cpp:395] conv0/scale -> conv0 (in-place)
I0504 06:37:33.898386 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.898525 50237 net.cpp:150] Setting up conv0/scale
I0504 06:37:33.898561 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898591 50237 net.cpp:165] Memory required for data: 414720032
I0504 06:37:33.898622 50237 layer_factory.hpp:77] Creating layer conv0/relu
I0504 06:37:33.898654 50237 net.cpp:100] Creating Layer conv0/relu
I0504 06:37:33.898684 50237 net.cpp:434] conv0/relu <- conv0
I0504 06:37:33.898712 50237 net.cpp:395] conv0/relu -> conv0 (in-place)
I0504 06:37:33.898746 50237 net.cpp:150] Setting up conv0/relu
I0504 06:37:33.898777 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898805 50237 net.cpp:165] Memory required for data: 483840032
I0504 06:37:33.898833 50237 layer_factory.hpp:77] Creating layer conv1/dw
I0504 06:37:33.898864 50237 net.cpp:100] Creating Layer conv1/dw
I0504 06:37:33.898893 50237 net.cpp:434] conv1/dw <- conv0
I0504 06:37:33.898922 50237 net.cpp:408] conv1/dw -> conv1/dw
I0504 06:37:33.898962 50237 net.cpp:150] Setting up conv1/dw
I0504 06:37:33.898993 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.899021 50237 net.cpp:165] Memory required for data: 552960032
I0504 06:37:33.899050 50237 layer_factory.hpp:77] Creating layer conv1/dw/bn

[...]

I0504 06:37:33.985625 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.985718 50237 net.cpp:100] Creating Layer conv13/dw/scale
    @     0x7f192267c2c0  caffe::GenerateBatchSamples()
I0504 06:37:33.987087 50237 net.cpp:434] conv13/dw/scale <- conv13/dw
I0504 06:37:33.987202 50237 net.cpp:395] conv13/dw/scale -> conv13/dw (in-place)
I0504 06:37:33.987262 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.987337 50237 net.cpp:150] Setting up conv13/dw/scale
I0504 06:37:33.987366 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987393 50237 net.cpp:165] Memory required for data: 3753455648
I0504 06:37:33.987419 50237 layer_factory.hpp:77] Creating layer conv13/dw/relu
I0504 06:37:33.987447 50237 net.cpp:100] Creating Layer conv13/dw/relu
I0504 06:37:33.987470 50237 net.cpp:434] conv13/dw/relu <- conv13/dw
I0504 06:37:33.987504 50237 net.cpp:395] conv13/dw/relu -> conv13/dw (in-place)
I0504 06:37:33.987534 50237 net.cpp:150] Setting up conv13/dw/relu
I0504 06:37:33.987557 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987582 50237 net.cpp:165] Memory required for data: 3763286048
I0504 06:37:33.987607 50237 layer_factory.hpp:77] Creating layer conv13
I0504 06:37:33.987639 50237 net.cpp:100] Creating Layer conv13
I0504 06:37:33.987665 50237 net.cpp:434] conv13 <- conv13/dw
I0504 06:37:33.987691 50237 net.cpp:408] conv13 -> conv13
    @     0x7f19226dc732  caffe::AnnotatedDataLayer<>::load_batch()
    @     0x7f19226e000a  caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
    @     0x7f191ec9fbcd  (unknown)
    @     0x7f191c4326db  start_thread
    @     0x7f19210eb88f  clone
Aborted (core dumped)

我想这可能是内存问题，因为它在 buildinf conv 层中失败（我正在 CPU 上训练），但我的批量大小已经是 24。有谁知道究竟是什么导致了这个问题以及如何解决？ 谢谢！

Answer 1

在这个问题上花了很多时间并尝试了无数的解决方案之后，我终于找到了导致这个问题的原因。这个错误特别危险，因为在大多数情况下，它决定简单地不给出错误信息。

在此处查看原始主题：https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120

在编译之前，您必须稍微编辑一下源代码。转到 caffe/src/caffe/util/math_functions.cpp 并在第 247 行找到此函数，您应该将其编辑为如下所示：

void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
  CHECK_GE(n, 0);
  CHECK(r);
  
  if (a > b) {
    Dtype c = a;
    a = b;
    b = c;
  }
  CHECK_LE(a, b);
  boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
  boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
      variate_generator(caffe_rng(), random_distribution);
  for (int i = 0; i < n; ++i) {
    r[i] = variate_generator();
  }
}

请注意，我刚刚添加了一个 if 语句（切换变量 a 和 b if a 大于 b) 并从 Dtype a 的参数行中删除了 const 标志和 D 类型 b。然后简单地做：

make clean
make -j$(nproc)
make py -j$(nproc)
make test -j$(nproc)
make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.

对我来说，这很管用！

中止 Caffe 培训 - 无错误消息

Aborted Caffe training - no error message

c++

out-of-memory

training-data

caffe

mobilenet