在输入函数中使用数据集 API 时，Tensorflow Estimator.predict_scores 未产生正确数量的预测

Question

我正在使用 tensorflow 1.5，我对这种我无法解释的奇怪行为感到困惑。
我制作了一个最小的例子：

import tensorflow as tf
import numpy as np


def input_function(x, y, batch_size=128, shuffle=True, n_epochs=None):
    data_set = tf.data.Dataset.from_tensor_slices({"x": x, "y": y})
    if shuffle:
        data_set = data_set.shuffle(buffer_size=1024, seed=None, reshuffle_each_iteration=True)
    data_set = data_set.batch(batch_size)
    data_set = data_set.repeat(n_epochs)
    iterator = data_set.make_one_shot_iterator()
    example = iterator.get_next()
    return {"features": example["x"]}, example["y"]


def main():
    n_samples = 256
    n_features = 16
    n_labels = 1

    x = np.random.rand(n_samples, n_features).astype(np.float32)
    y = np.random.rand(n_samples, n_labels).astype(np.float32)

    feature_column = tf.contrib.layers.real_valued_column(column_name='features', dimension=n_features)
    estimator = tf.contrib.learn.DNNRegressor([10], [feature_column], optimizer=tf.train.AdamOptimizer())

    estimator.fit(input_fn=lambda: input_function(x, y, batch_size=128, shuffle=True, n_epochs=32))
    pred = estimator.predict_scores(input_fn=lambda: input_function(x, y, batch_size=16, shuffle=False, n_epochs=1))
    print("len(pred) = {} (should be {})".format(len(list(pred)), n_samples))


if __name__ == '__main__':
    main()

在此示例中，对 'fit' 的调用似乎工作正常（虽然我不确定）但对 'predict_scores' 的调用仅产生 batch_size (=16)预测而不是 n_samples (=256)。我做错了什么？
如果我使用 tf.esimator.inputs.numpy_input_fn 这个问题就会消失，尽管最终我将不得不使用一个使用 TFRecordDataset 的输入函数从 tfrecord 文件中读取大量训练数据，类似于此处显示的内容： https://www.tensorflow.org/programmers_guide/datasets#using_high-level_apis
任何帮助将不胜感激。

Answer 1

这是 tf.contrib.learn.Estimator class 中的错误，它错误地假设 input is constant，并且只读取一个批次，而不是运行输入函数多个次获取所有数据。 tf.contrib.learn.Estimator 和 tf.contrib.learn.DNNRegressor class 已弃用并计划删除，因此不太可能修复它们。

但是，tf.estimator.DNNRegressor class 已修复，可以与 tf.data 一起使用，您可以修改代码以按如下方式使用它：

def main():
    n_samples = 256
    n_features = 16
    n_labels = 1

    x = np.random.rand(n_samples, n_features).astype(np.float32)
    y = np.random.rand(n_samples, n_labels).astype(np.float32)

    feature_column = tf.contrib.layers.real_valued_column(
        column_name='features', dimension=n_features)

    # Use the `tf.estimator.DNNRegressor` constructor instead of
    # `tf.contrib.learn.DNNRegressor`.
    estimator = tf.estimator.DNNRegressor(
        [10], [feature_column], optimizer=tf.train.AdamOptimizer())

    # Replace `estimator.fit()` with `estimator.train()`.
    estimator.train(input_fn=lambda: input_function(
        x, y, batch_size=128, shuffle=True, n_epochs=32))

    # Replace `estimator.predict_scores()` with `estimator.predict()`.
    pred = estimator.predict(input_fn=lambda: input_function(
        x, y, batch_size=16, shuffle=False, n_epochs=1))

    print("len(pred) = {} (should be {})".format(len(list(pred)), n_samples))

在输入函数中使用数据集 API 时，Tensorflow Estimator.predict_scores 未产生正确数量的预测

Tensorflow Estimator.predict_scores not yielding the correct number of predictions when using the Dataset API in the input function

python

tensorflow

tensorflow-datasets

tensorflow-estimator