使用 Tensorflow/Keras 和内存问题训练一个神经网络,输入作为矩阵的滑动 windows

Train a neural network with input as sliding windows of a matrix with Tensorflow / Keras, and memory issues

我有一个数据集,它是一个大矩阵形状 (100 000, 2 000)。

我想用这个大矩阵的 所有可能的滑动 windows/submatrices 形状 (16, 2000) 来训练我的 Tensorflow 神经网络。

我使用:

from skimage.util.shape import view_as_windows

A.shape  # (100000, 2000)  ie 100k x 2k matrix
X = view_as_windows(A, (16, 2000)).reshape((-1, 16, 2000, 1))
X.shape   # (99985, 16, 2000, 1)
...
model.fit(X, Y, batch_size=4, epochs=8)

不幸的是,这会导致内存问题:

W tensorflow/core/framework/allocator.cc:122] Allocation of ... exceeds 10% of system memory.

这很正常,因为 X 有 ~ 100k * 16 * 2k 个系数,即超过 30 亿个系数!

但实际上,在内存中加载X是一种内存浪费,因为它高度冗余:它是由滑动windows形状(16 , 2000) 超过 A.

问题:如何在 100k x 2k 矩阵上训练输入 所有滑动 windows 宽度 16 的神经网络,而不浪费内存?

skimage.util.view_as_windows 的文档确实指出它的内存开销很大:

One should be very careful with rolling views when it comes to memory usage. Indeed, although a ‘view’ has the same memory footprint as its base array, the actual array that emerges when this ‘view’ is used in a computation is generally a (much) larger array than the original, especially for 2-dimensional arrays and above.

For example, let us consider a 3 dimensional array of size (100, 100, 100) of float64. [...] the hypothetical size of the rolling view (if one was to reshape the view for example) would be 8*(100-3+1)3*33 which is about 203 MB! The scaling becomes even worse as the dimension of the input array becomes larger.


编辑:timeseries_dataset_from_array 正是我要找的,只是它只适用于一维序列:

import tensorflow
import tensorflow.keras.preprocessing
x = list(range(100))
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, 10, sequence_stride=1, sampling_rate=1, batch_size=128, shuffle=False, seed=None, start_index=None, end_index=None)
for b in x2:
    print(b)

它不适用于二维数组:

x = np.array(range(90)).reshape(6, 15)
print(x)
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, (6, 3), sequence_stride=1, sampling_rate=1, batch_size=128, shuffle=False, seed=None, start_index=None, end_index=None)
# does not work

您可以使用生成器即时生成示例,而不是将它们存储在内存中。

您可以编写自定义 generator 或由 keras 提供的生成器,例如 timeseries_dataset_from_array (docs),它们也可以在 windows 等选项的帮助下产生 windows 15=].

对于自定义生成器,你可以这样做

def generator_custom(df3):
    for idex,row in df3.iterrows():
            #some preprocessing
            yield X,y

然后你可以使用tf.data将128/64/32批次作为

tf.data.Dataset.from_generator(lambda: generator_custom(df_train))
train_dataset = train_dataset.batch(128,drop_remainder=True)

回复你关于二维的评论

举个例子(我把 100000,2000 缩放到 1000.200,可以随意更改)

import numpy as np
x = np.array(range(200000)).reshape(1000, 200)
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, 16, sequence_stride=1, sampling_rate=1, batch_size=128)

这给你类似的东西

shapes (128, 16, 200)

shapes (128, 16, 200)

你想要的是(16*2000)对吧? (记住我们有 200 个只是为了展示目的)

如果使用 Tensorflow,您可以使用 Tensorflow 数据集并将预处理函数映射到数据上,如下所示:

import tensorflow as tf

A.shape # (100000, 2000)

def get_window(starting_idx):
    """Extract a window from A of shape (16, 2000) as a tf.Tensor"""
    return tf.convert_to_tensor(A[starting_idx : starting_idx + 16])

# Make dataset for actual data
data_ds = tf.data.Dataset.range(A.shape[0] - 16)
data_ds = data_ds.map(get_window)

# Make dataset for labels
label_ds = tf.data.Dataset.from_tensor_slices(Y)

# Zip them into one dataset
ds = tf.data.Dataset.zip((data_ds, label_ds))

# Pre-batch the dataset
ds = ds.batch(4)

# Sanity check for batch size
for batch, label in ds:
    print(batch.shape)    # (4, 16, 2000, 1)
    break

# Now call .fit() without batch size
model.fit(ds, epochs=8)

定义一个函数来提取每个 window 并将其映射到现有数据集应该可以解决您的内存问题,因为它应该允许仅在需要时形成 windows。

这通常是使用 Tensorflow 时处理数据的最佳方式之一,您可以通过这种方式处理大量数据。

有关详细信息,请参阅 tf.data.Dataset and tensorflow.org/guide/data