TensorFlow 中损失函数 (MLP) 的奇怪 NaN 值

Strange NaN values for loss function (MLP) in TensorFlow

希望你能帮助我。我正在使用 TensorFlow 和我在互联网上找到的一些教程来实现一个小型多层感知器。问题是网络能够学习一些东西,我的意思是我能够以某种方式优化训练误差的值并获得不错的准确性,这就是我的目标。但是,我正在使用 Tensorboard 为损失函数记录一些奇怪的 NaN 值。实际上相当多。在这里你可以看到我最新的损失函数输出的 Tensorboard 记录。请所有这些三角形后接不连续点 - 这些是 NaN 值,还要注意函数的总体趋势是您所期望的。

Tensorboard 报告

我认为高学习率可能是问题所在,或者网络太深,导致梯度爆炸,所以我降低了学习率并使用了一个隐藏层(这是上图和下面的代码)。什么都没有改变,我只是导致学习过程变慢了。

Tensorflow 代码

import tensorflow as tf
import numpy as np
import scipy.io, sys, time
from numpy import genfromtxt
from random import shuffle

#shuffles two related lists #TODO check that the two lists have same size
def shuffle_examples(examples, labels):
    examples_shuffled = []
    labels_shuffled = []
    indexes = list(range(len(examples)))
    shuffle(indexes)
    for i in indexes:
        examples_shuffled.append(examples[i])
        labels_shuffled.append(labels[i])
    examples_shuffled = np.asarray(examples_shuffled)
    labels_shuffled = np.asarray(labels_shuffled)
return examples_shuffled, labels_shuffled

# Import and transform dataset
dataset = scipy.io.mmread(sys.argv[1])
dataset = dataset.astype(np.float32)

all_labels  = genfromtxt('oh_labels.csv', delimiter=',')
num_examples    = all_labels.shape[0]

dataset, all_labels = shuffle_examples(dataset, all_labels)

# Split dataset into training (66%) and test (33%) set
training_set_size = 2000
training_set = dataset[0:training_set_size]
training_labels = all_labels[0:training_set_size]
test_set = dataset[training_set_size:num_examples]
test_labels = all_labels[training_set_size:num_examples]

test_set, test_labels = shuffle_examples(test_set, test_labels)

# Parameters
learning_rate = 0.0001
training_epochs = 150
mini_batch_size = 100
total_batch = int(num_examples/mini_batch_size)

# Network Parameters
n_hidden_1 = 50 # 1st hidden layer of neurons
#n_hidden_2 = 16 # 2nd hidden layer of neurons
n_input = int(sys.argv[2]) # number of features after LSA
n_classes = 2;

# Tensorflow Graph input
with tf.name_scope("input"):
    x = tf.placeholder(np.float32, shape=[None, n_input], name="x-data")
    y = tf.placeholder(np.float32, shape=[None, n_classes], name="y-labels")

print("Creating model.")

# Create model
def multilayer_perceptron(x, weights, biases):
    with tf.name_scope("h_layer_1"):
        # First hidden layer with SIGMOID activation
        layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
        layer_1 = tf.nn.sigmoid(layer_1)
    #with tf.name_scope("h_layer_2"):
        # Second hidden layer with SIGMOID activation
        #layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
        #layer_2 = tf.nn.sigmoid(layer_2)
    with tf.name_scope("out_layer"):
        # Output layer with SIGMOID activation
        out_layer = tf.add(tf.matmul(layer_1, weights['out']), biases['bout'])
        out_layer = tf.nn.sigmoid(out_layer)
    return out_layer

# Layer weights
with tf.name_scope("weights"):
    weights = {
        'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=0.01, dtype=np.float32)),      
        #'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], stddev=0.05, dtype=np.float32)),
        'out': tf.Variable(tf.random_normal([n_hidden_1, n_classes], stddev=0.01, dtype=np.float32))
}

# Layer biases 
with tf.name_scope("biases"):
    biases = {
        'b1': tf.Variable(tf.random_normal([n_hidden_1], dtype=np.float32)),
        #'b2': tf.Variable(tf.random_normal([n_hidden_2], dtype=np.float32)),
        'bout': tf.Variable(tf.random_normal([n_classes], dtype=np.float32))
    }

# Construct model
pred = multilayer_perceptron(x, weights, biases)

# Define loss and optimizer
with tf.name_scope("loss"):
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
with tf.name_scope("adam"):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# Initializing the variables
init = tf.initialize_all_variables()

# Define summaries
tf.scalar_summary("loss", cost)
summary_op = tf.merge_all_summaries()

print("Model ready.")

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    board_path = sys.argv[3]+time.strftime("%Y%m%d%H%M%S")+"/"
    writer = tf.train.SummaryWriter(board_path, graph=tf.get_default_graph())   

    print("Starting Training.")
    for epoch in range(training_epochs):
        training_set, training_labels = shuffle_examples(training_set, training_labels)
        for i in range(total_batch):
            # example loading
            minibatch_x = training_set[i*mini_batch_size:(i+1)*mini_batch_size]
            minibatch_y = training_labels[i*mini_batch_size:(i+1)*mini_batch_size]
            # Run optimization op (backprop) and cost op
            _, summary = sess.run([optimizer, summary_op], feed_dict={x: minibatch_x, y: minibatch_y})
            # Write log
            writer.add_summary(summary, epoch*total_batch+i)

    print("Optimization Finished!")

# Test model
test_error = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
accuracy = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(accuracy, np.float32))
test_error, accuracy = sess.run([test_error, accuracy], feed_dict={x: test_set, y: test_labels})
print("Test Error: " + test_error.__str__() + "; Accuracy: " + accuracy.__str__())
print("Tensorboard path: " + board_path)

我将 post 解决方案放在这里,以防万一有人遇到类似的问题。如果你非常仔细地看这个图,所有的 NaN 值(三角形)都会定期出现,就像在每个循环结束时某些东西导致损失函数的输出只是 NaN 一样。 问题是,在每个循环中,我都给出了一小批 "empty" 个示例。问题在于我如何声明我的内部训练循环:

for i in range(total_batch):

现在我们想要的是让 Tensorflow 遍历整个训练集,一次一个小批量。那么让我们看看 total_batch 是如何声明的:

total_batch = int(num_examples / mini_batch_size)

这不是我们想要做的 - 因为我们只想 考虑训练集。所以将此行更改为:

total_batch = int(training_set_size / mini_batch_size)

已解决问题。 需要注意的是,Tensorflow 似乎忽略了那些 "empty" 批次,计算损失的 NaN 但不更新梯度 - 这就是为什么损失的趋势是正在学习某些东西的网络之一。