在具有嵌入层的模型中，测试数据在 Keras 中给出预测误差

Question

我已经训练了一个 Bi-LSTM 模型来在一组句子上找到 NER。为此，我使用了不同的单词，并在单词和数字之间进行了映射，然后使用这些数字创建了 Bi-LSTM 模型。然后我创建并腌制该模型对象。

现在我得到一组新句子，其中包含训练模型没有见过的某些词。因此这些词直到现在还没有数值。因此，当我在我以前存在的模型上测试它时，它会报错。无法找到单词或特征，因为它们的数值不存在。

为了避免这个错误，我为我看到的所有新词赋予了一个新的整数值。

但是，当我加载模型并对其进行测试时，出现错误：

InvalidArgumentError: indices[0,24] = 5444 is not in [0, 5442)   [[Node: embedding_14_16/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_14_16/embeddings/read, embedding_14_16/Cast)]]

训练数据包含填充词在内的5445个词。因此 = [0, 5444]

5444是我给测试句中的padding设置的索引值。不清楚为什么它假设索引值介于 [0, 5442).

之间

我使用了以下可用的基本代码link：https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm

代码：

input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50
                  , input_length=max_len)(input)

model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

#number of  epochs - Also for output file naming
epoch_num=20
domain="../data/Laptop_Prediction_Corrected"
output_file_name=domain+"_E"+str(epoch_num)+".xlsx"

model_name="../models/Laptop_Prediction_Corrected"
output_model_filename=model_name+"_E"+str(epoch_num)+".sav"


history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=epoch_num, validation_split=0.1, verbose=1)

max_len 是句子中的单词总数，n_words 是词汇量。在模型中，填充已使用以下代码完成，其中 n_words=5441:

X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words)

新数据集中的填充：

max_len = 50
# this is to pad sentences to the maximum length possible
#-> so all records of X will be of the same length

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=res_new_word2idx["pad_blank"])

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=5441)

不确定这些填充中的哪一个是正确的？

但是，vocab 仅包含训练数据中的单词。当我说：

p = loaded_model.predict(X)

如何将 predict 用于包含初始词汇中不存在的单词的文本句子？

Answer 1

您可以使用 Keras Tokenizer class 及其方法轻松标记化和预处理输入数据。在实例化时指定词汇大小，然后在训练数据上使用其 fit_on_texts() 方法根据给定的文本构建词汇表。之后，您可以使用其 text_to_sequences() 方法将每个文本字符串转换为单词索引列表。好处是只考虑词汇表中的单词，忽略所有其他单词（您可以通过将 oov_token=1 传递给 Tokenizer class 将这些单词设置为一个）：

from keras.preprocessing.text import Tokenizer

# set num_words to limit the vocabulary to the most frequent words
tok = Tokenizer(num_words=n_words)

# you can also pass an arbitrary token as `oov_token` argument 
# which will represent out-of-vocabulary words and its index would be 1
# tok = Tokenizer(num_words=n_words, oov_token='[unk]')

tok.fit_on_texts(X_train)

X_train = tok.text_to_sequences(X_train)
X_test = tok.text_to_sequences(X_test)  # use the same vocab to convert test data to sequences

您可以选择使用 pad_sequences 函数用零填充它们或截断它们以使它们具有相同的长度：

from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

现在，如果您没有使用 oov 令牌，词汇量将等于 n_words+1，如果您使用它，则词汇量将等于 n_words+2。然后您可以将正确的数字作为其 input_dim 参数（第一个位置参数）传递给嵌入层：

Embedding(correct_num_words, embd_size, ...)

在具有嵌入层的模型中，测试数据在 Keras 中给出预测误差

Test data giving prediction error in Keras in the model with Embedding layer

python

nlp

keras

tensorflow

word-embedding