在 Tensorflow 中使用来自大型 numpy 数组的数据集

Question

我正在尝试加载一个数据集，该数据集存储在我驱动器上的两个 .npy 文件（用于特征和基本事实）中，并用它来训练神经网络。

print("loading features...")
data = np.load("[...]/features.npy")

print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255

dataset = tf.data.Dataset.from_tensor_slices((data, labels))

调用 from_tensor_slices() 方法时抛出 tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. 错误。

ground truth 的文件大于 2.44GB，因此我在使用它创建数据集时遇到问题（请参阅警告 here and here）。

我找到的可能解决方案是针对 TensorFlow 1.x (here and , while I am running version 2.6) or to use numpy's memmap ()，不幸的是我没有达到运行，而且我想知道这是否会减慢计算速度？

非常感谢你的帮助，谢谢！

Answer 1

您需要某种数据生成器，因为您的数据太大无法直接放入 tf.data.Dataset.from_tensor_slices。我没有您的数据集，但这里有一个示例，说明您如何获取数据批次并在自定义训练循环中训练您的模型。数据是来自 here:

的 NPZ NumPy 存档

import numpy as np

def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
    dataset_zip = np.load(file, encoding='latin1')

    images = dataset_zip['imgs']
    latents_classes = dataset_zip['latents_classes']

    return images, latents_classes

def get_batch(indices, train_images, train_categories):
    shapes_as_categories = np.array([train_categories[i][1] for i in indices])
    images = np.array([train_images[i] for i in indices])

    return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
        shapes_as_categories.shape[0], 1).astype('float32')]

# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)

epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size

for epoch in range(epochs):
    for i in range(total_batch):
        batch_indices = indices[batch_size * i: batch_size * (i + 1)]
        batch = get_batch(batch_indices, train_images, train_categories)
        ...
        ...
        # Train your model with this batch.

在 Tensorflow 中使用来自大型 numpy 数组的数据集

Using Datasets from large numpy arrays in Tensorflow

python

numpy

dataset

tensorflow

tensorflow-datasets