tf.datasets input_fn 1 个纪元后出现错误

Question

所以我正在尝试使用 tf.datasets 切换到 input_fn()，如中所述。虽然我已经能够使用 tf.datasets 和下面的 input_fn() 获得更好的 steps/sec，但当运行时，我似乎运行在 1 个纪元后出现错误在 GCMLE 上进行这个实验。考虑这个 input_fn():

def input_fn(...):
    files = tf.data.Dataset.list_files(filenames).shuffle(num_shards)

    dataset = files.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TextLineDataset(filename).skip(1), cycle_length=num_shards))
    dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda row:
        parse_csv_dataset(row, hparams = hparams), 
        batch_size = batch_size, 
        num_parallel_batches = multiprocessing.cpu_count())) 
    dataset = dataset.prefetch(1)
    if shuffle:
        dataset = dataset.shuffle(buffer_size = 10000)
    dataset = dataset.repeat(num_epochs)

    iterator = dataset.make_initializable_iterator()
    features = iterator.get_next()
    tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)

    labels = {key: features.pop(key) for key in LABEL_COLUMNS}

    return features, labels

我在 GCMLE 上收到以下错误：

disable=protected-access InvalidArgumentError (see above for traceback): Inputs to operation loss/sparse_softmax_cross_entropy_loss/num_present/Select of type Select must have the same size and shape. Input 0: [74] != input 1: [110] [[Node: loss/sparse_softmax_cross_entropy_loss/num_present/Select = Select[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss/sparse_softmax_cross_entropy_loss/num_present/Equal, loss/sparse_softmax_cross_entropy_loss/num_present/zeros_like, loss/sparse_softmax_cross_entropy_loss/num_present/ones_like)]] [[Node: global_step/add/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3099_global_step/add", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

这意味着存在形状不匹配 Input 0: [74] != input 1: [110]，但是我基于 input_fn() 的旧队列在完全相同的数据上工作正常，所以我认为这不是问题基础数据。这发生在我认为是时代结束的时候（因为当 GCMLE 错误结束时 num_steps 就在 num_train_examples/batch_size 附近所以我猜问题可能是最后一批不等于 batch_size 110（因为它显示在错误中），而只有 74 个示例。有人可以确认这是错误吗？假设是这样，我是否还有其他标志需要设置以便最后一批可以不是指定的批大小 110？

对于它的价值，我已经用两个不同的数据集复制了这种行为（使用基于旧队列的多个时期的训练 input_fn，在 tf.datasets 的第一个时期结束时被挂断input_fn)

Answer 1

您图表中的某些操作（根据错误消息，可能 sparse_softmax_cross_entropy_loss）似乎需要固定的批量大小。可能是您的代码（不是 input_fn 的一部分）强制执行此操作（例如，将 batch_size 作为运算中使用的某些张量的形状传递），或者它可能是 TF 之一图书馆。

这本身并不总是问题。然而， tf.data.Dataset.batch 的 documented behavior 是：

NOTE: If the number of elements (N) in this dataset is not an exact multiple of batch_size, the final batch contain smaller tensors with shape N % batch_size in the batch dimension. If your program depends on the batches having the same shape, consider using the tf.contrib.data.batch_and_drop_remainder transformation instead.

按照目前所写，您的 (non-input_fn) 代码属于依赖于具有相同形状的批次的类别。

您的选择是追踪代码在何处通过静态批量大小或 "drop the remainder"。我认为前者更可取，但工作量更大。

如果您选择后者，请注意您实际使用的不是 tf.data.Dataset.batch，而是接受 drop_remainder 参数的 tf.contrib.data.map_and_batch。

Answer 2

正如 Robbie 在 , it looks like your old implementation used fixed batch sizes throughout (presumably using an API like tf.train.batch() or one of its wrappers with the default argument of allow_smaller_final_batch=False), and the default behavior of batching in tf.data (via tf.data.Dataset.batch() and tf.contrib.data.map_and_batch()) 中建议的那样，包括较小的最终批次。

该错误最有可能出现在 model_fn 中。没有看到那个函数，很难猜测，但我怀疑通过 Tensor.set_shape() (possibly in library code) or a bug in the implementation of tf.losses.sparse_softmax_cross_entropy().

对张量的形状有一个明确的（不正确的）断言

首先，我假设从 input_fn() 返回的 features 和 labels 张量具有静态未知的批量大小。您能否通过打印 features 和 labels 对象并确保其报告的 Tensor.shape 属性在第 0 维具有 None 来确认？

接下来，在您的 model_fn 中找到对 tf.losses.sparse_softmax_cross_entropy() 的调用。打印作为 weights 参数传递给此函数的对象，它应该是 tf.Tensor，并找到它的静态形状。鉴于您看到的错误，我怀疑它的形状类似于 (110,)，其中 110 是您指定的批量大小。如果是这种情况，model_fn 中有一个错误会错误地断言权重的形状是整批，而实际上可能不是。（如果不是这种情况，则 tf.losses.sparse_softmax_cross_entropy() 中存在错误！请打开一个 GitHub issue 示例，使我们能够重现该问题。）

Aside: Why would this explain the bug? The code that calls the failing tf.where() op looks like this (edited for readability):
num_present = tf.where(tf.equal(weights, 0.0),  # This input is shape [74]
                       tf.zeros_like(weights),  # This input is shape [110]
                       tf.ones_like(weights)    # This input is probably [110]
)
This flavor of tf.where() op (named "Select" in the error message for historical reasons) requires that all three inputs have the same size. Superficially, tf.equal(weights, 0.0), tf.ones_like(weights), and tf.zeros_like(weights) all have the same shape, which is the shape of weights. However, if the static shape (the result of Tensor.shape) differs from the , then the behavior is undefined.

What actually happens? In this particular case, let's say the static shape of weights is [110], but the dynamic shape is [74]. The static shape of our three arguments to tf.where() will be [110]. The implementation of tf.equal() doesn't care that there's a mismatch, so its dynamic shape will be [74]. The implementations of tf.zeros_like() and tf.ones_like() use an optimization that ignores that dynamic shape when the static shape is fully defined, and so their dynamic shapes will be [110], causing the error you are seeing.

正确的解决方法是在您的 model_fn 中找到断言固定批处理大小的代码并将其删除。 TensorFlow 中的优化和评估逻辑对可变批量大小具有鲁棒性，这将确保您的所有数据都用于训练和评估过程。

一个不太理想的短期解决方案是删除数据末尾的小批量。这里有几个选项：

在每个纪元结束时随机丢弃一些数据：
- 使用 TF 1.8 或更高版本，将 drop_remainder=False 传递给 tf.contrib.data.map_and_batch()。
- 对于 TF 1.7 或更早版本，在 map_and_batch 之后使用 dataset = dataset.filter(lambda features: tf.equal(tf.shape(features[LABEL_COLUMNS[0]])[0], batch_size))。
删除最后一批数据：
- 将 dataset.repeat(NUM_EPOCHS) 移到 map_and_batch() 之前，然后应用上述两个修复程序之一。

tf.datasets input_fn 1 个纪元后出现错误

tf.datasets input_fn getting error after 1 epoch

tensorflow

google-cloud-ml

tensorflow-datasets