Caffe

Question

我无法从头开始训练 VGG NET 模型。请允许我描述一下我到目前为止的步骤：

来自包含我的 training 和 validation 图片的两个文件夹，标记为作为 training_list.txt 和 validation_list.txt，生成了 data.mdb 个文件。检查后，这些 mdb 文件显示有效的 img 数据和正确的标签。
生成了一个 mean_training_image.binaryproto 文件。
已将 lmdb 数据集上传到 Floydhub 云端，对其进行训练使用 GPU，使用

floyd run --gpu --env caffe:py2 --data patalanov/datasets/vgg-my-face:input 'caffe train -solver models/Custom_Model/solver.prototxt'
已下载文件 _iter_3000.caffemodel.

这是我的网络打印出来的：

blobs ['data', 'conv1', 'norm1', 'pool1', 'conv2', 'pool2', 'conv3', 'conv4', 'conv5', 'pool5', 'fc6', 'fc7', 'fc8', 'prob']
params ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'fc6', 'fc7', 'fc8_cat']

  Layer Name :   conv1, Weight Dims :(96, 3, 7, 7) 
  Layer Name :   conv2, Weight Dims :(256, 96, 5, 5) 
  Layer Name :   conv3, Weight Dims :(512, 256, 3, 3) 
  Layer Name :   conv4, Weight Dims :(512, 512, 3, 3) 
  Layer Name :   conv5, Weight Dims :(512, 512, 3, 3) 
  Layer Name :     fc6, Weight Dims :(4048, 25088) 
  Layer Name :     fc7, Weight Dims :(4048, 4048) 
  Layer Name : fc8_cat, Weight Dims :(6, 4048) 

fc6 weights are (4048, 25088) dimensional and biases are (4048,) dimensional
fc7 weights are (4048, 4048) dimensional and biases are (4048,) dimensional
fc8_cat weights are (6, 4048) dimensional and biases are (6,) dimensional

但是，有些地方不对劲，因为 Loss 在训练期间没有减少。日志：

Iteration 2980 (6.05491 iter/s, 1.65155s/10 iters), loss = 1.79258
Iteration 2980, lr = 0.001
Iteration 2990 (6.28537 iter/s, 1.591s/10 iters), loss = 1.79471
Iteration 2990, lr = 0.001
Snapshotting to binary proto file /output/_iter_3000.caffemodel
Snapshotting solver state to binary proto file /output/_iter_3000.solverstate
Iteration 3000, loss = 1.7902
Iteration 3000, Testing net (#0)
Ignoring source layer training_train
Optimization Done.
Optimization Done.

似乎初始化的权重没有通过网络传播，因为从 conv1 开始，权重全部为零。

   ...,


       [[[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        ...,

train.prototxt

name: "CaffeNet"
layers {
  name: "training_train"
  type: DATA
  data_param {
    source: "/input/training_set_lmdb"
    backend: LMDB
    batch_size: 30
  }
  transform_param{
    mean_file: "/input/mean_training_image.binaryproto"
  }
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
}
layers {
  name: "training_test"
  type: DATA
  data_param {
    source: "/input/validation_set_lmdb"
    backend: LMDB
    batch_size: 15
  }
  transform_param{
    mean_file: "/input/mean_training_image.binaryproto"
  }
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
}
layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 96
    kernel_size: 7
    stride: 2
    weight_filler {
        type: "gaussian" 
        std: 0.01        
      }
      bias_filler {
        type: "constant" 
        value: 0
      }
  }
  blobs_lr: 0
  blobs_lr: 0
}
layers {
  name: "relu1"
  type: RELU
  bottom: "conv1"
  top: "conv1"
}
layers {
  name: "norm1"
  type: LRN
  bottom: "conv1"
  top: "norm1"
  lrn_param {
    local_size: 5
    alpha: 0.0005
    beta: 0.75
  }
}
layers {
  name: "pool1"
  type: POOLING
  bottom: "norm1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 3
  }
}
layers {
  name: "conv2"
  type: CONVOLUTION
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
  }
  blobs_lr: 0
  blobs_lr: 0
}
layers {
  name: "relu2"
  type: RELU
  bottom: "conv2"
  top: "conv2"
}
layers {
  name: "pool2"
  type: POOLING
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layers {
  name: "conv3"
  type: CONVOLUTION
  bottom: "pool2"
  top: "conv3"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
  blobs_lr: 0
  blobs_lr: 0
}
layers {
  name: "relu3"
  type: RELU
  bottom: "conv3"
  top: "conv3"
}
layers {
  name: "conv4"
  type: CONVOLUTION
  bottom: "conv3"
  top: "conv4"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
  blobs_lr: 0
  blobs_lr: 0
}
layers {
  name: "relu4"
  type: RELU
  bottom: "conv4"
  top: "conv4"
}
layers {
  name: "conv5"
  type: CONVOLUTION
  bottom: "conv4"
  top: "conv5"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
  blobs_lr: 0
  blobs_lr: 0
}
layers {
  name: "relu5"
  type: RELU
  bottom: "conv5"
  top: "conv5"
}
layers {
  name: "pool5"
  type: POOLING
  bottom: "conv5"
  top: "pool5"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 3
  }
}
layers {
  name: "fc6"
  type: INNER_PRODUCT
  bottom: "pool5"
  top: "fc6"
  inner_product_param {
    num_output: 4048
  }
  blobs_lr: 1.0
  blobs_lr: 1.0
}
layers {
  name: "relu6"
  type: RELU
  bottom: "fc6"
  top: "fc6"
}
layers {
  name: "drop6"
  type: DROPOUT
  bottom: "fc6"
  top: "fc6"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "fc7"
  type: INNER_PRODUCT
  bottom: "fc6"
  top: "fc7"
  inner_product_param {
    num_output: 4048
  }
  blobs_lr: 1.0
  blobs_lr: 1.0
}
layers {
  name: "relu7"
  type: RELU
  bottom: "fc7"
  top: "fc7"
}
layers {
  name: "drop7"
  type: DROPOUT
  bottom: "fc7"
  top: "fc7"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "fc8_cat"
  type: INNER_PRODUCT
  bottom: "fc7"
  top: "fc8"
  inner_product_param {
    num_output: 6
  }
  blobs_lr: 1.0
  blobs_lr: 1.0
}
layers {
  name: "prob"
  type: SOFTMAX_LOSS
  bottom: "fc8"
  bottom: "label"
}

deploy.prototxt

name: "VGG_FACE_16_layers"
input: "data"
input_dim: 1
input_dim: 3
input_dim: 224
input_dim: 224
layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 96
    kernel_size: 7
    stride: 2
    weight_filler {
        type: "gaussian" 
        std: 0.01        
      }
      bias_filler {
        type: "constant" 
        value: 0
      }
  }
}
layers {
  name: "relu1"
  type: RELU
  bottom: "conv1"
  top: "conv1"
}
layers {
  name: "norm1"
  type: LRN
  bottom: "conv1"
  top: "norm1"
  lrn_param {
    local_size: 5
    alpha: 0.0005
    beta: 0.75
  }
}
layers {
  name: "pool1"
  type: POOLING
  bottom: "norm1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 3
  }
}
layers {
  name: "conv2"
  type: CONVOLUTION
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
  }
}
layers {
  name: "relu2"
  type: RELU
  bottom: "conv2"
  top: "conv2"
}
layers {
  name: "pool2"
  type: POOLING
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layers {
  name: "conv3"
  type: CONVOLUTION
  bottom: "pool2"
  top: "conv3"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
}
layers {
  name: "relu3"
  type: RELU
  bottom: "conv3"
  top: "conv3"
}
layers {
  name: "conv4"
  type: CONVOLUTION
  bottom: "conv3"
  top: "conv4"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
}
layers {
  name: "relu4"
  type: RELU
  bottom: "conv4"
  top: "conv4"
}
layers {
  name: "conv5"
  type: CONVOLUTION
  bottom: "conv4"
  top: "conv5"
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
  }
}
layers {
  name: "relu5"
  type: RELU
  bottom: "conv5"
  top: "conv5"
}
layers {
  name: "pool5"
  type: POOLING
  bottom: "conv5"
  top: "pool5"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 3
  }
}
layers {
  name: "fc6"
  type: INNER_PRODUCT
  bottom: "pool5"
  top: "fc6"
  inner_product_param {
    num_output: 4048
  }
}
layers {
  name: "relu6"
  type: RELU
  bottom: "fc6"
  top: "fc6"
}
layers {
  name: "drop6"
  type: DROPOUT
  bottom: "fc6"
  top: "fc6"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "fc7"
  type: INNER_PRODUCT
  bottom: "fc6"
  top: "fc7"
  inner_product_param {
    num_output: 4048
  }
}
layers {
  name: "relu7"
  type: RELU
  bottom: "fc7"
  top: "fc7"
}
layers {
  name: "drop7"
  type: DROPOUT
  bottom: "fc7"
  top: "fc7"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "fc8_cat"
  type: INNER_PRODUCT
  bottom: "fc7"
  top: "fc8"
  inner_product_param {
    num_output: 6
  }
}
layers {
  name: "prob"
  type: SOFTMAX
  bottom: "fc8"
  top: "prob"
}

solver.prototxt

net: "models/Custom_Model/train.prototxt"
# test_iter specifies how many forward passes the test should carry out
test_iter: 1
# Carry out testing every X training iterations
test_interval: 20
# Learning rate and momentum parameters for Adam
base_lr: 0.001
momentum: 0.9
momentum2: 0.999
# Adam takes care of changing the learning rate
lr_policy: "fixed"
# Display every X iterations
display: 10
# The maximum number of iterations
max_iter: 3000
# snapshot intermediate results
snapshot: 1000
snapshot_prefix: "/output/"
# solver mode: CPU or GPU
type: "Adam"
solver_mode: GPU

我在这里错过了什么？

编辑日志：前 10 次迭代：

2018-06-24 13:45:30 PSTI0624 20:45:30.842751 22 solver.cpp:218] Iteration 0 (0 iter/s, 0.345767s/10 iters), loss = 1.79176
2018-06-24 13:45:30 PSTI0624 20:45:30.842778 22 sgd_solver.cpp:105] Iteration 0, lr = 0.001
2018-06-24 13:45:32 PSTI0624 20:45:32.362357 22 net.cpp:591] [Forward] Layer training_train, top blob data data: 32.8544
2018-06-24 13:45:32 PSTI0624 20:45:32.362499 22 net.cpp:591] [Forward] Layer training_train, top blob label data: 2.46875
2018-06-24 13:45:32 PSTI0624 20:45:32.373751 22 net.cpp:591] [Forward] Layer conv1, top blob conv1 data: 3.80577
2018-06-24 13:45:32 PSTI0624 20:45:32.373879 22 net.cpp:603] [Forward] Layer conv1, param blob 0 data: 0.00792726
2018-06-24 13:45:32 PSTI0624 20:45:32.375567 22 net.cpp:603] [Forward] Layer conv1, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.379498 22 net.cpp:591] [Forward] Layer relu1, top blob conv1 data: 1.88893
2018-06-24 13:45:32 PSTI0624 20:45:32.382942 22 net.cpp:591] [Forward] Layer norm1, top blob norm1 data: 1.86441
2018-06-24 13:45:32 PSTI0624 20:45:32.384709 22 net.cpp:591] [Forward] Layer pool1, top blob pool1 data: 2.64384
2018-06-24 13:45:32 PSTI0624 20:45:32.407202 22 net.cpp:591] [Forward] Layer conv2, top blob conv2 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.407317 22 net.cpp:603] [Forward] Layer conv2, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.407389 22 net.cpp:603] [Forward] Layer conv2, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.408679 22 net.cpp:591] [Forward] Layer relu2, top blob conv2 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.409492 22 net.cpp:591] [Forward] Layer pool2, top blob pool2 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.422981 22 net.cpp:591] [Forward] Layer conv3, top blob conv3 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.423092 22 net.cpp:603] [Forward] Layer conv3, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.423151 22 net.cpp:603] [Forward] Layer conv3, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.423795 22 net.cpp:591] [Forward] Layer relu3, top blob conv3 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.451364 22 net.cpp:591] [Forward] Layer conv4, top blob conv4 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.451510 22 net.cpp:603] [Forward] Layer conv4, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.451575 22 net.cpp:603] [Forward] Layer conv4, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.452227 22 net.cpp:591] [Forward] Layer relu4, top blob conv4 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.479389 22 net.cpp:591] [Forward] Layer conv5, top blob conv5 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.479573 22 net.cpp:603] [Forward] Layer conv5, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.479640 22 net.cpp:603] [Forward] Layer conv5, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.480298 22 net.cpp:591] [Forward] Layer relu5, top blob conv5 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.480830 22 net.cpp:591] [Forward] Layer pool5, top blob pool5 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.487016 22 net.cpp:591] [Forward] Layer fc6, top blob fc6 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.490016 22 net.cpp:603] [Forward] Layer fc6, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.490097 22 net.cpp:603] [Forward] Layer fc6, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.490191 22 net.cpp:591] [Forward] Layer relu6, top blob fc6 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.490295 22 net.cpp:591] [Forward] Layer drop6, top blob fc6 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.491454 22 net.cpp:591] [Forward] Layer fc7, top blob fc7 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492004 22 net.cpp:603] [Forward] Layer fc7, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492074 22 net.cpp:603] [Forward] Layer fc7, param blob 1 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492199 22 net.cpp:591] [Forward] Layer relu7, top blob fc7 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492300 22 net.cpp:591] [Forward] Layer drop7, top blob fc7 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492488 22 net.cpp:591] [Forward] Layer fc8_cat, top blob fc8 data: 0.00944262
2018-06-24 13:45:32 PSTI0624 20:45:32.492555 22 net.cpp:603] [Forward] Layer fc8_cat, param blob 0 data: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.492619 22 net.cpp:603] [Forward] Layer fc8_cat, param blob 1 data: 0.00944262
2018-06-24 13:45:32 PSTI0624 20:45:32.492844 22 net.cpp:591] [Forward] Layer prob, top blob (automatic) data: 1.79202
2018-06-24 13:45:32 PSTI0624 20:45:32.492954 22 net.cpp:619] [Backward] Layer prob, bottom blob fc8 diff: 0.00868093
2018-06-24 13:45:32 PSTI0624 20:45:32.493074 22 net.cpp:619] [Backward] Layer fc8_cat, bottom blob fc7 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.493140 22 net.cpp:630] [Backward] Layer fc8_cat, param blob 0 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.493204 22 net.cpp:630] [Backward] Layer fc8_cat, param blob 1 diff: 0.0208672
2018-06-24 13:45:32 PSTI0624 20:45:32.493306 22 net.cpp:619] [Backward] Layer drop7, bottom blob fc7 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.493403 22 net.cpp:619] [Backward] Layer relu7, bottom blob fc7 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.496160 22 net.cpp:619] [Backward] Layer fc7, bottom blob fc6 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.496706 22 net.cpp:630] [Backward] Layer fc7, param blob 0 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.496806 22 net.cpp:630] [Backward] Layer fc7, param blob 1 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.496896 22 net.cpp:619] [Backward] Layer drop6, bottom blob fc6 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.496992 22 net.cpp:619] [Backward] Layer relu6, bottom blob fc6 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.509070 22 net.cpp:630] [Backward] Layer fc6, param blob 0 diff: 0
2018-06-24 13:45:32 PSTI0624 20:45:32.509187 22 net.cpp:630] [Backward] Layer fc6, param blob 1 diff: 0
2018-06-24 13:45:32 PSTE0624 20:45:32.526118 22 net.cpp:719] [Backward] All net params (data, diff): L1 norm = (111.926, 0.125203); L2 norm = (1.18177, 0.0578014)

Answer 1

发现问题：

看你发的，可以清楚的看出哪里出了问题：

...] [Forward] Layer conv1, top blob conv1 data: 3.80577     # looks good - non-zero signal 
...] [Forward] Layer conv1, param blob 0 data: 0.00792726    # non-zero kernels 
...] [Forward] Layer conv1, param blob 1 data: 0             # bias (not so important)
.
.
.
...] [Forward] Layer conv2, top blob conv2 data: 0           # no output signal !!!
...] [Forward] Layer conv2, param blob 0 data: 0             # kernels are all zero !!!
...] [Forward] Layer conv2, param blob 1 data: 0

conv2 的内核（权重）全部为零，因此从这一层出来的所有 blob 都是零，从那里开始的所有内容都是零 - 你无法以这种方式学习。

为什么会这样？

让我们仔细看看conv1（好层）和conv2（坏层）在你的prorotxt中的定义方式：

layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 96
    kernel_size: 7
    stride: 2
    weight_filler {
        type: "gaussian" 
        std: 0.01        
      }
      bias_filler {
        type: "constant" 
        value: 0
      }
  }
}
.
.
.
layers {
  name: "conv2"
  type: CONVOLUTION
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
  }
}

你能看出区别吗？
虽然 conv1 定义了 type: gaussian weight_filler，但 conv2 没有 weight_filler！ Caffe，默认情况下，将 conv2 的 kernels/weights 初始化为零，然后一切都从那一点开始向南...

Caffe - 网络不学习

Caffe - network not learning

computer-vision

neural-network

deep-learning

convolutional-neural-network

发现问题：

为什么会这样？