带有预训练嵌入的 Keras 自动编码器返回不正确的维数
Keras autoencoder with pretrained embeddings returning incorrect number of dimensions
我一直在尝试根据 an example from the Deep Learning with Keras book.
松散地复制一个句子自动编码器
我重新编写了示例以使用嵌入层而不是句子生成器并使用 fit
与 fit_generator
。
我的代码如下:
df_train_text = df['string']
max_length = 80
embedding_dim = 300
latent_dim = 512
batch_size = 64
num_epochs = 10
# prepare tokenizer
t = Tokenizer(filters='')
t.fit_on_texts(df_train_text)
word_index = t.word_index
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_train_text = t.texts_to_matrix(df_train_text)
padded_train_text = pad_sequences(encoded_train_text, maxlen=max_length, padding='post')
padding_train_text = np.asarray(padded_train_text, dtype='int32')
embeddings_index = {}
f = open('/Users/embedding_file.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
#Found 51328 word vectors.
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(vocab_size,
embedding_dim,
weights=[embedding_matrix],
input_length=max_length,
trainable=False)
inputs = Input(shape=(max_length,), name="input")
embedding_layer = embedding_layer(inputs)
encoder = Bidirectional(LSTM(latent_dim), name="encoder_lstm", merge_mode="sum")(embedding_layer)
decoder = RepeatVector(max_length)(encoder)
decoder = Bidirectional(LSTM(embedding_dim, name='decoder_lstm', return_sequences=True), merge_mode="sum")(decoder)
autoencoder = Model(inputs, decoder)
autoencoder.compile(optimizer="adam", loss="mse")
autoencoder.fit(padded_train_text, padded_train_text,
epochs=num_epochs,
batch_size=batch_size,
callbacks=[checkpoint])
我确认我的层形状与示例中的相同,但是当我尝试适合我的自动编码器时,我收到以下错误:
ValueError: Error when checking target: expected bidirectional_1 to have 3 dimensions, but got array with shape (36320, 80)
我尝试的其他一些事情包括将 texts_to_matrix
切换为 texts_to_sequence
和 wrapping/not 包装我的填充字符串
我也遇到了 ,这似乎表明我的做法是错误的。是否可以像我编码的那样将自动编码器与嵌入层相匹配?如果没有,有人可以帮助解释所提供的示例和我的版本之间的根本区别吗?
编辑:我删除了最后一层的 return_sequences=True
参数并得到以下错误:ValueError: Error when checking target: expected bidirectional_1 to have shape (300,) but got array with shape (80,)
更新图层形状后:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 80) 0
_________________________________________________________________
embedding_8 (Embedding) (None, 80, 300) 2440200
_________________________________________________________________
encoder_lstm (Bidirectional) (None, 512) 3330048
_________________________________________________________________
repeat_vector_8 (RepeatVecto (None, 80, 512) 0
_________________________________________________________________
bidirectional_8 (Bidirection (None, 300) 1951200
=================================================================
Total params: 7,721,448
Trainable params: 5,281,248
Non-trainable params: 2,440,200
_________________________________________________________________
我是否在模型的 RepeatVector
层和最后一层之间遗漏了一个步骤,以便我可以 return (None, 80, 300) 的形状而不是它当前生成的 (None, 300) 形状?
Embedding
层将形状为 (num_words,)
的整数序列(即单词索引)作为输入,并给出相应的嵌入作为形状为 (num_words, embd_dim)
的输出。因此,在给定文本上拟合 Tokenizer
实例后,您需要使用其 texts_to_sequences()
方法将每个文本转换为整数序列:
encoded_train_text = t.texts_to_sequences(df_train_text)
此外,由于在填充 encoded_train_text
之后它的形状将是 (num_samples, max_length)
,网络的输出形状也必须具有相同的形状(即因为我们正在创建一个自动编码器)因此您需要删除最后一层的 return_sequences=True
参数。否则,它会给我们一个 3D 张量作为输出,这是没有意义的。
作为旁注,以下行是多余的,因为 padded_train_text
已经是一个 numpy 数组(顺便说一下,您根本没有使用 padding_train_text
):
padding_train_text = np.asarray(padded_train_text, dtype='int32')
我一直在尝试根据 an example from the Deep Learning with Keras book.
松散地复制一个句子自动编码器我重新编写了示例以使用嵌入层而不是句子生成器并使用 fit
与 fit_generator
。
我的代码如下:
df_train_text = df['string']
max_length = 80
embedding_dim = 300
latent_dim = 512
batch_size = 64
num_epochs = 10
# prepare tokenizer
t = Tokenizer(filters='')
t.fit_on_texts(df_train_text)
word_index = t.word_index
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_train_text = t.texts_to_matrix(df_train_text)
padded_train_text = pad_sequences(encoded_train_text, maxlen=max_length, padding='post')
padding_train_text = np.asarray(padded_train_text, dtype='int32')
embeddings_index = {}
f = open('/Users/embedding_file.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
#Found 51328 word vectors.
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(vocab_size,
embedding_dim,
weights=[embedding_matrix],
input_length=max_length,
trainable=False)
inputs = Input(shape=(max_length,), name="input")
embedding_layer = embedding_layer(inputs)
encoder = Bidirectional(LSTM(latent_dim), name="encoder_lstm", merge_mode="sum")(embedding_layer)
decoder = RepeatVector(max_length)(encoder)
decoder = Bidirectional(LSTM(embedding_dim, name='decoder_lstm', return_sequences=True), merge_mode="sum")(decoder)
autoencoder = Model(inputs, decoder)
autoencoder.compile(optimizer="adam", loss="mse")
autoencoder.fit(padded_train_text, padded_train_text,
epochs=num_epochs,
batch_size=batch_size,
callbacks=[checkpoint])
我确认我的层形状与示例中的相同,但是当我尝试适合我的自动编码器时,我收到以下错误:
ValueError: Error when checking target: expected bidirectional_1 to have 3 dimensions, but got array with shape (36320, 80)
我尝试的其他一些事情包括将 texts_to_matrix
切换为 texts_to_sequence
和 wrapping/not 包装我的填充字符串
我也遇到了
编辑:我删除了最后一层的 return_sequences=True
参数并得到以下错误:ValueError: Error when checking target: expected bidirectional_1 to have shape (300,) but got array with shape (80,)
更新图层形状后:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 80) 0
_________________________________________________________________
embedding_8 (Embedding) (None, 80, 300) 2440200
_________________________________________________________________
encoder_lstm (Bidirectional) (None, 512) 3330048
_________________________________________________________________
repeat_vector_8 (RepeatVecto (None, 80, 512) 0
_________________________________________________________________
bidirectional_8 (Bidirection (None, 300) 1951200
=================================================================
Total params: 7,721,448
Trainable params: 5,281,248
Non-trainable params: 2,440,200
_________________________________________________________________
我是否在模型的 RepeatVector
层和最后一层之间遗漏了一个步骤,以便我可以 return (None, 80, 300) 的形状而不是它当前生成的 (None, 300) 形状?
Embedding
层将形状为 (num_words,)
的整数序列(即单词索引)作为输入,并给出相应的嵌入作为形状为 (num_words, embd_dim)
的输出。因此,在给定文本上拟合 Tokenizer
实例后,您需要使用其 texts_to_sequences()
方法将每个文本转换为整数序列:
encoded_train_text = t.texts_to_sequences(df_train_text)
此外,由于在填充 encoded_train_text
之后它的形状将是 (num_samples, max_length)
,网络的输出形状也必须具有相同的形状(即因为我们正在创建一个自动编码器)因此您需要删除最后一层的 return_sequences=True
参数。否则,它会给我们一个 3D 张量作为输出,这是没有意义的。
作为旁注,以下行是多余的,因为 padded_train_text
已经是一个 numpy 数组(顺便说一下,您根本没有使用 padding_train_text
):
padding_train_text = np.asarray(padded_train_text, dtype='int32')