使用 keras 在 gcloud ml-engine 上处理 TB 数据的最佳方法

Question

我想用 gcloud 存储上大约 2TB 的图像数据训练模型。我将图像数据另存为单独的 tfrecords 并尝试使用 tensorflow 数据 api 按照此示例

https://medium.com/@moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36

但 keras 的 model.fit(...) 似乎不支持基于

的 tfrecord 数据集验证

https://github.com/keras-team/keras/pull/8388

是否有更好的方法来处理我缺少的来自 ml-engine 的 keras 的大量数据？

非常感谢！

Answer 1

如果您愿意使用 tf.keras 而不是实际的 Keras，您可以使用 tf.data API 实例化一个 TFRecordDataset 并将其直接传递给 model.fit(). 奖励：您可以直接从 Google 云存储进行流式传输，无需先下载数据：

# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)

model.fit(ds_train)

要包含验证数据，请使用您的验证 TFRecords 创建一个 TFRecordDataset，并将其传递给 model.fit() 的 validation_data 参数。注意：这是可能的 as of TensorFlow 1.9。

最后说明：您需要指定 steps_per_epoch 参数。我用来了解所有 TFRecord 文件中示例总数的一种技巧是简单地遍历文件并计数：

import tensorflow as tf

def n_records(record_list):
    """Get the total number of records in a collection of TFRecords.
    Since a TFRecord file is intended to act as a stream of data,
    this needs to be done naively by iterating over the file and counting.
    See 

    Args:
        record_list (list): list of GCS paths to TFRecords files
    """
    counter = 0
    for f in record_list:
        counter +=\
            sum(1 for _ in tf.python_io.tf_record_iterator(f))
    return counter

你可以用它来计算steps_per_epoch:

n_train = n_records([gs://path-to-tfrecords/record1,
                     gs://path-to-tfrecords/record2])

steps_per_epoch = n_train // batch_size

使用 keras 在 gcloud ml-engine 上处理 TB 数据的最佳方法

Best way to process terabytes of data on gcloud ml-engine with keras

keras

tensorflow

google-cloud-ml

tfrecord

tensorflow-datasets