使用 lmdb 进行 caffe 多标签训练以对面部区域进行分类

Question

我正在使用两个 lmdb 输入来识别面部的眼睛、鼻尖和嘴巴区域。数据 lmdb 的维度为 Nx3xHxW 而标签 lmdb 的维度为 Nx1xH/4xW/4 .标签图像是通过在初始化为全 0 的 opencv Mat 上使用数字 1-4 屏蔽区域创建的（因此总共有 5 个标签，其中 0 是背景标签）。我将标签图像缩小为相应图像宽度和高度的 1/4，因为我的网络中有 2 个池化层。这种缩小确保标签图像尺寸将匹配最后一个卷积层的输出。

我的train_val.prototxt:

name: "facial_keypoints"
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TRAIN
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TRAIN
}
data_param {
source: "../train_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TEST
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TEST
}
data_param {
source: "../test_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "images"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv_last"
type: "Convolution"
bottom: "pool2"
top: "conv_last"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 5
pad: 2
kernel_size: 5
stride: 1
weight_filler {
#type: "xavier"
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv_last"
top: "conv_last"
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv_last"
bottom: "labels"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv_last"
bottom: "labels"
top: "loss"
}

在最后一个卷积层中，我将输出大小设置为5，因为我有5个标签类。训练收敛于最终损失约为 0.3 和准确度为 0.9（尽管一些消息来源表明这种准确度对于多标签没有正确测量）。使用经过训练的模型时，输出层正确地生成了一个尺寸为 1x5xH/4xW/4 的 blob，我设法将其可视化为 5 个独立的单通道图像。然而，虽然第一张图片正确突出显示了背景像素，但其余 4 张图片看起来几乎相同，所有 4 个区域都突出显示。

5 个输出通道的可视化（强度从蓝色增加到红色）：

原始图像（同心圆标记了每个通道的最高强度。有些更大只是为了与其他通道区分开来。正如你所看到的，除了背景，其余通道几乎在同一个嘴部区域具有最高激活，这应该事实并非如此。)

有人能帮我找出我犯的错误吗？

谢谢。

Answer 1

您似乎正面临 class imbalance：您的大部分标记像素都标记为 0（背景），因此，在训练过程中，网络几乎不管它是什么都学会预测背景 "sees"。由于大部分时间预测背景是正确的，因此训练损失减少并且准确度增加到某个点。
然而，当您实际尝试可视化输出预测时，它主要是背景，几乎没有关于其他稀缺标签的信息。

在 caffe 中解决 class 不平衡的一种方法是使用层调整权重以抵消标签的不平衡。

使用 lmdb 进行 caffe 多标签训练以对面部区域进行分类

caffe multi-label training with lmdb to classifiy facial regions

computer-vision

neural-network

deep-learning

caffe

conv-neural-network