为 LSTM 文本生成模型编写生成器函数
Write generator function for LSTM text generation model
我有一个用于文本生成的 LSTM 模型,但是在尝试增加输入数据量时,我 运行 遇到了 RAM 问题,所以我发现我可以使用 fit_generator 函数来加载一步一步的数据。
目前的问题是,当独特单词的数量增加时,keras.utils.to_categorical 会占用很多 space。
所以我想将此代码块转换为生成器函数:
x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)
#Shape of x_values: (152250, 261)
#Shape of y_values: (152250, 4399)
我得到了类似的东西,但我不确定如何为 batch_x 和 batch_y 分配正确的值
def generator(input_seq, batch_size):
index = 0
while True:
batch_x = np.zeros((batch_size, max_seq_length-1))
batch_y = np.zeros((batch_size, total_unique_words))
for i in range(batch_size):
batch_x[i] = input_seqs[:, :-1][i]
batch_y[i] = tf.keras.utils.to_categorical(input_seqs[:, -1][i], num_classes=total_unique_words)
index = index + 1
if index == len(input_seq):
index = 0
yield batch_x, batch_y
完整代码以获得更好的概述:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(review_list)
word_index = tokenizer.word_index
total_unique_words = len(tokenizer.word_index) + 1
input_sequences = []
for line in review_list:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_seqs = token_list[:i+1]
input_sequences.append(n_gram_seqs)
max_seq_length = max([len(x) for x in input_sequences])
input_seqs = np.array(pad_sequences(input_sequences, maxlen=max_seq_length, padding='pre'))
x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
K.clear_session()
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim = total_unique_words, output_dim=100, input_length=max_seq_length-1),
tf.keras.layers.LSTM(256, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(256),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(total_unique_words , activation='softmax')])
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
您可以尝试这样的操作:
def generator(input_seq, batch_size, dataset_size):
no_batches = int(dataset_size/batch_size)
batch_x = np.zeros((batch_size, max_seq_length-1))
batch_y = np.zeros((batch_size, total_unique_words))
for i in range(no_batches):
batch_x = input_seqs[:, :-1][(i*batch_size) : ((i+1)*batch_size)]
batch_y = tf.keras.utils.to_categorical(input_seqs[:, -1][(i*batch_size) : ((i+1)*batch_size)], num_classes=total_unique_words)
yield batch_x, batch_y
return
我添加了 dataset_size
(在您的情况下为 152250)参数,以便可以计算批次数。
我有一个用于文本生成的 LSTM 模型,但是在尝试增加输入数据量时,我 运行 遇到了 RAM 问题,所以我发现我可以使用 fit_generator 函数来加载一步一步的数据。
目前的问题是,当独特单词的数量增加时,keras.utils.to_categorical 会占用很多 space。
所以我想将此代码块转换为生成器函数:
x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)
#Shape of x_values: (152250, 261)
#Shape of y_values: (152250, 4399)
我得到了类似的东西,但我不确定如何为 batch_x 和 batch_y 分配正确的值
def generator(input_seq, batch_size):
index = 0
while True:
batch_x = np.zeros((batch_size, max_seq_length-1))
batch_y = np.zeros((batch_size, total_unique_words))
for i in range(batch_size):
batch_x[i] = input_seqs[:, :-1][i]
batch_y[i] = tf.keras.utils.to_categorical(input_seqs[:, -1][i], num_classes=total_unique_words)
index = index + 1
if index == len(input_seq):
index = 0
yield batch_x, batch_y
完整代码以获得更好的概述:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(review_list)
word_index = tokenizer.word_index
total_unique_words = len(tokenizer.word_index) + 1
input_sequences = []
for line in review_list:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_seqs = token_list[:i+1]
input_sequences.append(n_gram_seqs)
max_seq_length = max([len(x) for x in input_sequences])
input_seqs = np.array(pad_sequences(input_sequences, maxlen=max_seq_length, padding='pre'))
x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
K.clear_session()
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim = total_unique_words, output_dim=100, input_length=max_seq_length-1),
tf.keras.layers.LSTM(256, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(256),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(total_unique_words , activation='softmax')])
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
您可以尝试这样的操作:
def generator(input_seq, batch_size, dataset_size):
no_batches = int(dataset_size/batch_size)
batch_x = np.zeros((batch_size, max_seq_length-1))
batch_y = np.zeros((batch_size, total_unique_words))
for i in range(no_batches):
batch_x = input_seqs[:, :-1][(i*batch_size) : ((i+1)*batch_size)]
batch_y = tf.keras.utils.to_categorical(input_seqs[:, -1][(i*batch_size) : ((i+1)*batch_size)], num_classes=total_unique_words)
yield batch_x, batch_y
return
我添加了 dataset_size
(在您的情况下为 152250)参数,以便可以计算批次数。