为 LSTM 文本生成模型编写生成器函数

Question

我有一个用于文本生成的 LSTM 模型，但是在尝试增加输入数据量时，我运行遇到了 RAM 问题，所以我发现我可以使用 fit_generator 函数来加载一步一步的数据。

目前的问题是，当独特单词的数量增加时，keras.utils.to_categorical 会占用很多 space。

所以我想将此代码块转换为生成器函数：

x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)

#Shape of x_values: (152250, 261)
#Shape of y_values: (152250, 4399)

我得到了类似的东西，但我不确定如何为 batch_x 和 batch_y 分配正确的值

def generator(input_seq, batch_size):

    index = 0 
    while True:
      batch_x = np.zeros((batch_size, max_seq_length-1))
      batch_y = np.zeros((batch_size, total_unique_words))
      for i in range(batch_size):
        batch_x[i] = input_seqs[:, :-1][i]
        batch_y[i] = tf.keras.utils.to_categorical(input_seqs[:, -1][i], num_classes=total_unique_words)
        index = index + 1
        if index == len(input_seq):
          index = 0

      yield batch_x, batch_y

完整代码以获得更好的概述：

tokenizer = Tokenizer()
tokenizer.fit_on_texts(review_list)
word_index = tokenizer.word_index
total_unique_words = len(tokenizer.word_index) + 1 

input_sequences = []
for line in review_list:
  token_list = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(token_list)):
    n_gram_seqs = token_list[:i+1]
    input_sequences.append(n_gram_seqs)

max_seq_length = max([len(x) for x in input_sequences])
input_seqs = np.array(pad_sequences(input_sequences, maxlen=max_seq_length, padding='pre'))

x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
K.clear_session()
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim = total_unique_words, output_dim=100, input_length=max_seq_length-1),
tf.keras.layers.LSTM(256, return_sequences=True), 
tf.keras.layers.Dropout(0.2), 
tf.keras.layers.LSTM(256), 
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(total_unique_words , activation='softmax')])
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

Answer 1

您可以尝试这样的操作：

def generator(input_seq, batch_size, dataset_size):

    no_batches = int(dataset_size/batch_size)
    batch_x = np.zeros((batch_size, max_seq_length-1))
    batch_y = np.zeros((batch_size, total_unique_words))

    for i in range(no_batches):
        batch_x = input_seqs[:, :-1][(i*batch_size) : ((i+1)*batch_size)]
        batch_y = tf.keras.utils.to_categorical(input_seqs[:, -1][(i*batch_size) : ((i+1)*batch_size)], num_classes=total_unique_words)

        yield batch_x, batch_y

    return

我添加了 dataset_size（在您的情况下为 152250）参数，以便可以计算批次数。

为 LSTM 文本生成模型编写生成器函数

Write generator function for LSTM text generation model

python

nlp

deep-learning

keras

tensorflow