在 slim tensorflow 和 tf 记录批处理中微调 inceptionv3 的问题

Question

我正在尝试使用 slim tensorflow 库微调 inceptionv3 模型。在为其编写代码时，我无法理解某些事情。我试图阅读源代码（没有适当的文档）并想出了一些东西，我能够对其进行微调并保存检查点。这是我遵循的步骤 1. 我为我的训练数据创建了一个 tf.record，这很好，现在我正在使用下面的代码读取数据。

import tensorflow as tf
import tensorflow.contrib.slim.nets as nets
import tensorflow.contrib.slim as slim
import matplotlib.pyplot as plt
import numpy as np

# get the data and labels here

data_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/train1.tfrecords'

# Training setting
num_epochs = 100
initial_learning_rate = 0.0002
learning_rate_decay_factor = 0.7
num_epochs_before_decay = 5
num_classes = 5980

# load the checkpoint
model_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/inception_v3.ckpt'

# log directory
log_dir = '/home/sfarkya/nvidia_challenge/datasets/detrac/fine_tuned_model'

with tf.Session() as sess:
    feature = {'train/image': tf.FixedLenFeature([], tf.string),
               'train/label': tf.FixedLenFeature([], tf.int64)}

    # Create a list of filenames and pass it to a queue
    filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)

    # Define a reader and read the next record
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    # Decode the record read by the reader
    features = tf.parse_single_example(serialized_example, features=feature)

    # Convert the image data from string back to the numbers
    image = tf.decode_raw(features['train/image'], tf.float32)

    # Cast label data into int32
    label = tf.cast(features['train/label'], tf.int32)

    # Reshape image data into the original shape
    image = tf.reshape(image, [128, 128, 3])

    # Creates batches by randomly shuffling tensors
    images, labels = tf.train.shuffle_batch([image, label], batch_size=64, capacity=128, num_threads=2,
                                            min_after_dequeue=64)

现在我正在使用 slim 微调模型，这是代码。

  init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
    sess.run(init_op)

    # Create a coordinator and run all QueueRunner objects
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    # load model

    # load the inception model from the slim library - we are using inception v3
    #inputL = tf.placeholder(tf.float32, (64, 128, 128, 3))

    img, lbl = sess.run([images, labels])
    one_hot_labels = slim.one_hot_encoding(lbl, num_classes)

    with slim.arg_scope(slim.nets.inception.inception_v3_arg_scope()):
        logits, inceptionv3 = nets.inception.inception_v3(inputs=img, num_classes=5980, is_training=True,
                                                          dropout_keep_prob=.6)

    # Restore convolutional layers:

    variables_to_restore = slim.get_variables_to_restore(exclude=['InceptionV3/Logits', 'InceptionV3/AuxLogits'])
    init_fn = slim.assign_from_checkpoint_fn(model_path, variables_to_restore)

    # loss function
    loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
    total_loss = tf.losses.get_total_loss()

    # train operation
    train_op = slim.learning.create_train_op(total_loss + loss, optimizer= tf.train.AdamOptimizer(learning_rate=1e-4))

    print('Im here')
    # Start training.
    slim.learning.train(train_op, log_dir, init_fn=init_fn, save_interval_secs=20, number_of_steps= 10)

现在我对代码有几个问题，我很想不通。一旦代码到达 slim.learning.train 我没有看到任何打印，但是它正在训练，我可以在日志中看到。现在， 1. 如何给代码提供 epoch 的数量？现在是运行一步一步，每一步都有 batch_size = 64.
2. 我如何确保在代码 tf.train.shuffle_batch 中我没有重复我的图像并且我正在训练整个数据集？ 3. 如何在训练时打印损失值？

Answer 1

这里是您问题的答案。

你不能直接给 slim.learning.train 纪元。相反，您将批次数作为参数。它被称为number_of_steps。它用于在 line 709 上设置一个名为 should_stop_op 的操作。我假设您知道如何将纪元数转换为批次。
我认为 shuffle_batch 函数不会重复图像，因为它在内部使用 RandomShuffleQueue. According to ，RandomShuffleQueue 使用后台线程将元素排入队列：
- 同时 size(queue) < capacity：
  - 向队列添加一个元素

它将元素出列为：

而 number of elements dequeued < batch_size：
- 等待 size(queue) >= min_after_dequeue + 1 个元素。
- Select随机均匀地从队列中取出一个元素，从队列中取出，加入输出批次。

所以在我看来，元素重复的可能性很小，因为在 dequeuing 操作中，选择的元素从队列中删除。所以是无放回采样

是否会为每个 epoch 创建一个新队列？

输入到tf.train.shuffle_batch的张量是image和label，它们最终来自filename_queue。如果该队列无限期地生成 TFRecord 文件名，那么我认为 shuffle_batch 不会创建新队列。您还可以创建一个像这样的玩具代码来了解 shuffle_batch 是如何工作的。

接下来，如何训练整个数据集？在您的代码中，以下行获取 TFRecord 文件名列表。

filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)

如果 filename_queue 涵盖了您拥有的所有 TFRecords，那么您肯定是在对整个数据集进行训练。现在，如何洗牌整个数据集是另一个问题。正如@mrry 提到的here，不支持（目前，AFAIK）打乱out-of-memory 数据集。因此，最好的方法是准备多个数据集分片，每个分片包含大约 1024 个示例。将 TFRecord 文件名列表打乱为：

filename_queue = tf.train.string_input_producer([data_path], shuffle=True, capacity=1000)

请注意，我删除了 num_epochs = 1 参数并设置了 shuffle=True。这样它将无限期地生成 TFRecord 文件名的 shuffled 列表。现在在每个文件上，如果你使用 tf.train.shuffle_batch，你将得到一个 near-to-uniform 洗牌。基本上，随着每个分片中的示例数量趋于 1，您的洗牌将变得越来越均匀。我喜欢不设置 num_epochs 而是使用前面提到的 number_of_steps 参数终止训练。

要打印损失值，您可能只需编辑 training.py 并引入 logging.info('total loss = %f', total_loss)。不知道有没有更简单的方法。另一种不更改代码的方法是在 Tensorboard 中查看摘要。

有关如何在 Tensorboard 中查看摘要的文章非常有用，包括此答案末尾的 link。通常，您需要做以下事情。

创建 summary 对象。
将感兴趣的变量写入summary。
合并所有个人摘要。
创建一个 summary 操作。
创建摘要文件编写器。
以所需的频率在整个培训过程中编写摘要。

现在，如果您使用 slim.learning.train，第 5 步和第 6 步已经自动完成。

对于前 4 个步骤，您可以检查文件 train_image_classifier.py。第 472 行向您展示了如何创建 summaries 对象。第490、512、536行将相关变量写入summaries。第 549 行合并所有摘要，第 553 行创建一个操作。您可以将此操作传递给 slim.learning.train，您还可以指定要编写摘要的频率。在我看来，除了loss、total_loss、accuracy和learning rate之外的任何东西都不要写到summaries里，除非你要做具体的debug。如果你写直方图，那么 tensorboard 文件可能需要几十个小时来加载像 ResNet-50 这样的网络（我的 tensorboard 文件曾经是 28 GB，加载 6 天的进度需要 12 个小时！）。顺便说一下，您实际上可以使用 train_image_classifier.py 文件进行微调，您将跳过上面的大部分步骤。但是，我更喜欢这个，因为你可以学到很多东西。

请参阅 launching tensorboard 部分了解如何在浏览器中查看进度。

补充说明：

而不是最小化 total_loss + loss，您可以执行以下操作：

loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
tf.losses.add_loss(loss)
total_loss = tf.losses.get_total_loss()
train_op = slim.learning.create_train_op(total_loss, optimizer=tf.train.AdamOptimizer(learning_rate=1e-4))

我在学习Tensorflow的时候发现thispost非常有用

在 slim tensorflow 和 tf 记录批处理中微调 inceptionv3 的问题

Issue with fine-tuning inceptionv3 in slim tensorflow and tf record batches

python

deep-learning

tensorflow

tf-slim

tfrecord