Keras TF: ValueError: Input arrays should have the same number of samples as target arrays

Keras TF: ValueError: Input arrays should have the same number of samples as target arrays

使用带有 Tensorflow 后端的 Keras DL 库,我正在尝试使用内置的 IMDB 数据集实现用于情感分析的批处理和验证生成器。

数据集包含25000个训练样本和25000个测试样本。 因为为每个样本的单词数设置截止值会产生相当低的准确性,所以我试图对训练和测试样本进行批处理,这样内存负载就不会太糟糕。

当前代码:

from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout
from keras.layers import LSTM, TimeDistributed
from keras.datasets import imdb
from keras.callbacks import EarlyStopping, ModelCheckpoint
import numpy as np


max_features = 20000

def generate_batch(batchsize):
'''

'''
(x_train, y_train), (_,_) = imdb.load_data()
for i in range(0, len(x_train), batchsize):
    x_batch = x_train[i:(i+batchsize)]
    y_batch = y_train[i:(i+batchsize)]
    x_batch = sequence.pad_sequences(x_train, maxlen=None)
    yield(x_batch, y_batch)

def generate_val(valsize):
'''
'''
(_,_), (x_test, y_test) = imdb.load_data()    
for i in range(0, len(x_test), valsize):
    x_val = x_test[i:(i+valsize)]
    y_val = y_test[i:(i+valsize)]
    x_val = sequence.pad_sequences(x_test, maxlen=None)
    yield(x_val, y_val)

print('Build model...')
primary_model = Sequential()
primary_model.add(Embedding(input_dim = max_features,
                    output_dim = max_features,
                    trainable=False, 
                    weights=[(np.eye(max_features,max_features))], 
                    mask_zero=True))
primary_model.add(TimeDistributed(Dense(150, use_bias=False)))
primary_model.add(LSTM(128))
primary_model.add(Dense(2, activation='softmax'))
primary_model.summary()
primary_model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
filepath = "primeweights-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath,
                            verbose=1,
                            save_best_only=True)
early_stopping_monitor = EarlyStopping(patience=2)

primary_model.fit_generator(generate_batch(25),
                            steps_per_epoch = 1000,
                            epochs = 1, 
                            callbacks=[early_stopping_monitor],
                            validation_data=generate_val(25),
                            validation_steps=1000)


score, acc = primary_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

primary_model.save('primary_model_imdb.h5')

但是,在尝试 运行 当前代码时,Keras 向我抛出以下错误:

Traceback (most recent call last):
File "imdb_gen.py", line 94, in <module>
validation_steps = 1000)   
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/d/user/.local/lib/python3.5/site-packages/keras/models.py", 
line 1276, in fit_generator
initial_epoch=initial_epoch)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 2224, in fit_generator
class_weight=class_weight)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 1877, in train_on_batch
class_weight=class_weight)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 1490, in _standardize_user_data
_check_array_lengths(x, y, sample_weights)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 220, in _check_array_lengths
'and ' + str(list(set_y)[0]) + ' target samples.')
ValueError: Input arrays should have the same number of samples as target 
arrays. Found 25000 input samples and 25 target samples.

代码中存在多个错误:

  • 正如@Y 在评论中指出的那样。罗:
x_batch = sequence.pad_sequences(x_train, maxlen=None) # gives 25000 samples

x_batch = sequence.pad_sequences(x_batch, maxlen=None) # gives batch_size
  • 加载 imdb 数据集时你必须给出 num_words=max_features 否则你的嵌入层将期望输入 max_words 为 max_features 但最终会得到 word_ids 大于.
(x_train, y_train), (_,_) = imdb.load_data(num_words=max_features)
  • 建议在使用填充时提供 maxlen 否则将使用 maxlen 个批次,这可能会因批次而异
x_batch = sequence.pad_sequences(x_batch, maxlen=maxlen, padding='post')
  • 您在使用嵌入层时未对其进行训练并保持输入和输出维度相同。对我来说没有意义所以我改变了它。
primary_model.add(Embedding(input_dim = max_features,
                    output_dim = embedding_dim,
                    trainable=True, 
                    weights=[(np.eye(max_features,embedding_dim))], 
                    mask_zero=True))
  • 要根据测试数据评估您的模型,您必须先加载它,然后将其转换为填充序列
(_,_), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen, padding='post')
score, acc = primary_model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
  • 在使用生成器进行多个时期的训练时,我们必须确保生成器不断产生值。为此,一旦数据集结束,我们需要再次从 0 开始屈服。
def generate_batch(batchsize):

    (x_train, y_train), (_,_) = imdb.load_data(num_words=max_features)
    print("train_size", x_train.shape)
    while True:
        for i in range(0, len(x_train), batchsize):
            x_batch = x_train[i:(i+batchsize)]
            y_batch = y_train[i:(i+batchsize)]
            x_batch = sequence.pad_sequences(x_batch, maxlen=maxlen, padding='post')
            yield(x_batch, y_batch)

链接了完整的工作代码(更新了第 6 点的修复)here