神经网络虽然准确率很高,但预测效果很差

Neural network predicts very poorly though it has high accuracy

我正在研究 RNN。经过训练,我在测试数据集上得到了很高的准确率。但是,当我用一些外部数据进行预测时,它的预测效果很差。另外,我在人工神经网络上使用了相同的数据集,它有超过 300,000 条文本和 57 类,它的预测仍然很差。当我在机器学习模型上尝试相同的数据集时,它运行良好。

这是我的代码:

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, LSTM, BatchNormalization
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split

df = pd.read_excel("data.xlsx", usecols=["X", "y"])

df = df.sample(frac = 1)

X = np.array(df["X"])
y = np.array(df["y"])

le = LabelEncoder()
y = le.fit_transform(y)
y = y.reshape(-1,1)
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y)

num_words = 100000
token = Tokenizer(num_words=num_words)
token.fit_on_texts(X)
seq = token.texts_to_sequences(X)
X = sequence.pad_sequences(seq, padding = "pre", truncating = "pre")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Embedding(num_words, 96, input_length = X.shape[1]))
model.add(LSTM(108, activation='relu', dropout=0.1, recurrent_dropout = 0.2))
model.add(BatchNormalization())
model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer="rmsprop", metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train, epochs=4, batch_size=64, validation_data = (X_test, y_test))

loss, accuracy = model.evaluate(X_test, y_test)

以下是模型的历史图:

经过一些研究,我发现该模型实际上运行良好。问题是错误地使用了 Keras Tokenizer

在代码的最后,我使用了如下代码:

sentence = ["Example Sentence to Make Prediction."]
token.fit_on_texts(sentence) # <- This row is redundant.
seq = token.texts_to_sequences(sentence)
cx = sequence.pad_sequences(seq, maxlen = X.shape[1])
sx = np.argmax(model.predict(cx), axis=1)

当我想在新数据上再次安装 Tokenizer 时出现问题。因此,删除该代码行解决了我的问题。