tensorflow:NLP 自动文本生成器总是打印相同的词

tensorflow: NLP automatic text generator always prints the same word

我提前道歉,但我才刚刚开始探索 NLP 文本生成器的世界。 在对文本训练神经网络后,我尝试根据该模型和初始句子生成新文本。无论我从哪个起始句(seed text)开始,接下来自动生成的词都是and。 我不明白为什么以及如何解决这个问题。同样,我对此很陌生,所以我将不胜感激任何帮助。

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np 
import matplotlib.pyplot as plt

data=[' Hello, Chicago.If there is anyone out there who still doubts that America is a place where all things are possible,' ,
'who still wonders if the dream of our founders is alive in our time, who still questions the power of our democracy, tonight is your answer.',
'It’s the answer told by lines that stretched around schools and churches in numbers this nation has never seen,',
'by people who waited three hours and four hours, many for the first time in their lives, ',
'because they believed that this time must be different, that their voices could be that difference.',
'It’s the answer spoken by young and old, rich and poor, Democrat and Republican, black, white,',
'Hispanic, Asian, Native American, gay, straight, disabled and not disabled.',
'Americans who sent a message to the world that we have never been just a collection of individuals',
'or a collection of red states and blue states.',
'We are, and always will be, the United States of America.']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
total_words = len(tokenizer.word_index) + 1

input_sequences = []
for line in data:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

xs, labels =  input_sequences[:,:-1],input_sequences[:,-1]
ys=(tf.keras.utils.to_categorical(labels, num_classes=total_words))
nr_epochs=50
model = Sequential()
model.add(Embedding(total_words, nr_epochs, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history = model.fit(xs, ys, epochs=nr_epochs, verbose=1)
seed_text = "If there is anyone out there who still doubts"
next_words = 100
  
for _ in range(next_words):
    token_list1 = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list1], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted.all():
            output_word = word
            break
    seed_text += " " + output_word
print(seed_text) # it prints the starting sentence plus 'and' 100 times

我运行 你的代码并检查了过程。 predicted.all() 函数 returns 1 并且由于 tokenizer.word_index.items() returns 一个包含项目 ('and':1) 的 ditc,您的代码总是选择 'and' 这个词,因为它有值 1.

您可以尝试将 predicted.all() 更改为 np.argmax(predicted),其中 returns 是数组中最大值的索引。从而返回预测变量中得分最高的单词索引,即最可能的单词。

它仍然有 100 个单词的重复,但这与模型的性能有关。只有 10 个单词的预测看起来还不错。