多标签分类的 sigmoid 非线性阈值

Threshold value for sigmoid nonlinearity for multilabel classification

我正在尝试使用 DenseNet architecture to classify x-ray images from https://www.kaggle.com/nih-chest-xrays/data。该模型生成一个二进制标签向量,其中每个标签表示是否存在 14 种可能的病理:肺不张、心脏肥大、实变、水肿、积液、 肺气肿、纤维化、疝气、浸润、肿块、结节、胸膜增厚、肺炎和气胸。例如,健康患者的标签为 [0,0,0,0,0,0,0,0,0,0,0,0,0,0],而水肿和积液患者的标签为[0,0,0,1,1,0,0,0,0,0,0,0,0,0] 的标签。我用 tensorflow 构建了这个模型,因为这是一个多标签分类问题,所以我使用的成本函数是 tf.reduce_mean(tf.losses.sigmoid_cross_entropy(labels, logits)),使用 AdamOptimizer 将其最小化。然而,当我检查 s 型输出时,这些值都低于 0.5,导致 tf.round(logits) 为每个预测生成零。不同输入的实际对数是不同的,并且在 10000 次迭代后是非零值,所以我不认为梯度消失是问题所在。我有两个问题:

  1. 这个问题可能是模型实施不正确造成的吗?
  2. 如果我将 sigmoid 函数的阈值从 0.5 降低到 0.25 以提高模型精度,我会 "cheating" 吗?

谢谢。

这是模型的代码:

def DenseNet(features, labels, mode, params):

depth = params["depth"]
k = params["growth"]

if depth == 121:
    N = db_121
else:
    N = db_169

bottleneck_output = 4 * k

#before entering the first dense block, a conv operation with 16 output channels
#is performed on the input images

with tf.variable_scope('input_layer'):
    #l = tf.reshape(features, [-1, 224, 224, 1])
    feature_maps = 2 * k
    l = layers.conv(features, filter_size = 7, stride = 2, out_chn = feature_maps)
    l = tf.nn.max_pool(l,
                       padding='SAME',
                       ksize=[1,3,3,1],
                       strides=[1,2,2,1],
                       name='max_pool')

# each block is defined as a dense block + transition layer
with tf.variable_scope('block1'):
    for i in range(N[0]):
        with tf.variable_scope('bottleneck_layer.{}'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.{}'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition1', l)

with tf.variable_scope('block2'):
    for i in range(N[1]):
        with tf.variable_scope('bottleneck_layer.{}'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.{}'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition2', l)

with tf.variable_scope('block3'):
    for i in range(N[2]):
        with tf.variable_scope('bottleneck_layer.{}'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.{}'.format(i+1), l, bn_l)
    l = layers.transition_layer('transition3', l)

# the last block does not have a transition layer
with tf.variable_scope('block4'):
    for i in range(N[3]):
        with tf.variable_scope('bottleneck_layer.{}'.format(i+1)):
            bn_l = layers.batch_norm('BN', l)
            bn_l = tf.nn.relu(bn_l, name='relu')
            bn_l = layers.conv(bn_l, out_chn=bottleneck_output, filter_size=1)
        l = layers.add_layer('dense_layer.{}'.format(i+1), l, bn_l)

# classification (global max pooling and softmax)
with tf.name_scope('classification'):
    l = layers.batch_norm('BN', l)
    l = tf.nn.relu(l, name='relu')
    l = layers.pooling(l, filter_size = 7)
    l_shape = l.get_shape().as_list()
    l = tf.reshape(l, [-1, l_shape[1] * l_shape[2] * l_shape[3]])
    l = tf.layers.dense(l, units = 1000, activation = tf.nn.relu, name='fc1', kernel_initializer=tf.contrib.layers.xavier_initializer())
    output = tf.layers.dense(l, units = 14, name='fc2', kernel_initializer=tf.contrib.layers.xavier_initializer()) # [batch_size, 14]

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=output) # cost function
cost = tf.reduce_mean(cross_entropy, name='cost_fn')

茶碱!首先,让我重复一下我留下的评论,以防这个答案最终对你(也许其他人)有效:

I think you are on the right path, but you might be thinking about the problem in the wrong way. Might it be the case that the positives (1s) are way less frequent than the negatives (0s). Based on your loss function, think about what that might drive a softmax layer to do (would it be a better bet, intuitively, to be a model guessing all 1s or all 0s?). I think you are on the right track. Think precision, recall and what you actually want the model to do. Happy to write a full answer if that doesn't lead you in the right direction

你的问题有点棘手,因为我不知道预测值之间关系的完整背景(预测类别是独立的,高度依赖的等等)此外,你将不得不打电话关于精度和召回率的价值(你认为假阳性更糟吗?假阴性?它们同样糟糕吗?)。我认为对于初始通过,可能值得尝试 weighted_cross_entropy_with_logits。您可以根据指导您的精确召回决策的启发式来使模型做出正面和负面判断(在医学数据上,我认为假阴性是一件非常糟糕的事情

此答案基于您问题的 1000 英尺视角,因此如果它对您来说效果不佳,欢迎修改我的答案!如果您正在寻找纯粹的准确性(以 precision/recall 平衡为代价),可能值得尝试证明在训练集中您可以近似测试集中 类 的频率(并且随后对各个预测进行加权以匹配)。只要仔细实施,您的阈值想法就完全正确(不要在训练和测试之间共享频率信息等)

编辑:如果从文档中看不出来,本节将帮助指导您构建自定义损失函数(如果合适)!

  qz * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
= qz * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
= qz * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
= qz * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
= (1 - z) * x + (qz +  1 - z) * log(1 + exp(-x))
= (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))

(1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))