如何在不一次连接多个文件的字符串的情况下为 LSTM 创建输入和标签数据集列表?

How to create inputs and labels dataset list for LSTM without concating strings from multiple files at once?

我有多个大文本文件,每个文件大小为 1 GB。我想使用 Tensorflow Keras 训练 LSTM 模型以预测数据集的下一个词。我需要一次从所有文件内容组成的字符串中获取文本的块大小,然后使用第一个 block size - 1 字符串作为输入,块的最后一个字符串作为标签。我找到的每个教程都会加载全文文件,然后将它们连接起来并创建两个列表——一个用于输入,另一个用于标签。当我尝试为我的数据集执行此操作时,我的机器内存不足并且进程被 OS 终止。我有一个 8 GB 大小的 RAM。使用 Tensorflow 创建数据集的最佳方式是什么?

示例:

我有以下文本文件:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam efficitur viverra lacus, at porttitor ex bibendum at. Aenean venenatis lacus ex. Mauris ultrices laoreet sapien, at pharetra dolor consectetur id. Proin eleifend, ex condimentum auctor tincidunt, felis erat pharetra tellus, et venenatis augue metus in leo. Donec euismod orci non cursus eleifend. Vivamus blandit gravida arcu, sed pulvinar arcu. Fusce lobortis mauris in lectus molestie, eget condimentum ipsum cursus. Proin ultrices lobortis mauris quis dignissim. Maecenas efficitur feugiat sem nec accumsan. Nam placerat sapien sit amet sem interdum tristique. Praesent eu nibh elementum, iaculis risus eget, cursus lectus.

我要的是如下列表:

inputs = ["Lorem ipsum dolor", "ipsum dolor sit", "dolor sit amet,", ...]
labels = ["sit", "amet,", "consectetur", ...]

您可以尝试使用 tensorflow-text:

import tensorflow as tf
import tensorflow_text as tft

with open('data.txt', 'w') as f:
  f.write('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam efficitur viverra lacus?\n')

train_data = tf.data.TextLineDataset(['/content/data.txt'])

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int', max_tokens=50, pad_to_max_tokens=True)
vectorize_layer.adapt(train_data)

def sliding_window(x):
  window_size = 5
  encoded = vectorize_layer(x)
  x = tft.sliding_window(encoded, width=window_size, axis=0)
  y = tft.sliding_window(encoded, width=window_size + 1, axis=0)[:, -1]
  return x[:tf.shape(y)[0],:], y

train_data = train_data.map(sliding_window)

inputs = train_data.map(lambda x, y: x).flat_map(tf.data.Dataset.from_tensor_slices)
labels = train_data.map(lambda x, y: y).flat_map(tf.data.Dataset.from_tensor_slices)