为什么在 Keras 中嵌入层的矩阵大小为 vocab_size + 1?
Why in Keras embedding layer's matrix is a size of vocab_size + 1?
我有下面的玩具示例,其中我的词汇量大小为 7,嵌入大小为 8,但 Keras 嵌入层的权重输出为 8x8。 (?) 那个怎么样?这似乎与与 Keras 嵌入层相关的其他问题有关,即“最大整数索引 + 1”,我已经阅读了所有其他关于此的 Whosebug 查询,但所有这些都表明它不是 vocab_size + 1 而我的代码告诉我它是。
我问这个是因为我需要知道哪个嵌入向量与哪个词相关。
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word
{1: 'work',
2: 'well',
3: 'done',
4: 'good',
5: 'great',
6: 'effort',
7: 'nice'}
len(tokenizer.index_word) # 7
vocab_size = len(tokenizer.index_word)+1
model = Sequential()
model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_lay (Embedding) (None, 2, 8) 64
_________________________________________________________________
flatten_1 (Flatten) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0
model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()
[array([[-0.0389936 , -0.0294274 , 0.02361362, 0.01885288, -0.01246006,
-0.01004354, 0.01321061, -0.02298149],
[-0.01264734, -0.02058442, 0.0114141 , -0.02725944, -0.06267354,
0.05148344, -0.02335678, -0.06039589],
[ 0.0582506 , 0.00020944, -0.04691287, 0.02985037, 0.02437406,
-0.02782 , 0.00378997, 0.01849808],
[-0.01667434, -0.00078654, -0.04029636, -0.04981862, 0.01762467,
0.06667487, 0.00302309, 0.02881355],
[ 0.04509508, -0.01994639, 0.01837089, -0.00047283, 0.01141069,
-0.06225454, 0.01198813, 0.02102971],
[ 0.05014603, 0.04591557, -0.03119368, 0.04181939, 0.02837115,
-0.01640332, 0.0577693 , 0.01364574],
[ 0.01948108, -0.04200416, -0.06589368, -0.05397511, 0.02729052,
0.04164972, -0.03795817, -0.06763416],
[ 0.01284658, 0.05563928, -0.026766 , 0.03231764, -0.0441488 ,
-0.02879154, 0.02092744, 0.01947528]], dtype=float32)]
那么我如何从第 8 个向量(行)矩阵中获取我的 7 个词向量,例如 {1: 'work'...},第 8 个向量是什么意思?
如果我更改 vocab_size = len(tokenizer.index_word) - 不添加 (+1)
然后在尝试拟合模型时出现尺寸错误等问题
Embedding
图层在hood下使用tf.nn.embedding_lookup
,默认为zero-based。例如:
import tensorflow as tf
import numpy as np
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tf.random.set_seed(111)
# Create integer embeddings for demonstration purposes.
embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
print(padded_seq)
tf.nn.embedding_lookup(embeddings, padded_seq)
[[2 3]
[4 1]
[5 6]
[7 1]]
<tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
array([[[17., 11., 10., 16., 17., 16., 16., 17.],
[18., 15., 13., 13., 18., 18., 10., 16.]],
[[17., 16., 13., 12., 13., 15., 19., 14.],
[12., 15., 12., 15., 10., 19., 15., 12.]],
[[18., 15., 11., 13., 13., 13., 16., 10.],
[18., 18., 11., 12., 10., 13., 14., 10.]],
--> [[ 0., 0., 0., 0., 0., 0., 0., 0.] <--,
[12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>
注意整数 7 是如何映射到零的,因为 tf.nn.embedding_lookup
只知道如何将值从 0 映射到 6。这就是原因,你应该使用 vocab_size = len(tokenizer.index_word)+1
,因为你想要整数 7 的有意义的向量:
embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
tf.nn.embedding_lookup(embeddings, padded_seq)
然后可以为未知标记保留索引 0,因为您的词汇表是从 1 开始的。
我有下面的玩具示例,其中我的词汇量大小为 7,嵌入大小为 8,但 Keras 嵌入层的权重输出为 8x8。 (?) 那个怎么样?这似乎与与 Keras 嵌入层相关的其他问题有关,即“最大整数索引 + 1”,我已经阅读了所有其他关于此的 Whosebug 查询,但所有这些都表明它不是 vocab_size + 1 而我的代码告诉我它是。 我问这个是因为我需要知道哪个嵌入向量与哪个词相关。
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word
{1: 'work', 2: 'well', 3: 'done', 4: 'good', 5: 'great', 6: 'effort', 7: 'nice'}
len(tokenizer.index_word) # 7
vocab_size = len(tokenizer.index_word)+1
model = Sequential()
model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_lay (Embedding) (None, 2, 8) 64
_________________________________________________________________
flatten_1 (Flatten) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0
model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()
[array([[-0.0389936 , -0.0294274 , 0.02361362, 0.01885288, -0.01246006,
-0.01004354, 0.01321061, -0.02298149],
[-0.01264734, -0.02058442, 0.0114141 , -0.02725944, -0.06267354,
0.05148344, -0.02335678, -0.06039589],
[ 0.0582506 , 0.00020944, -0.04691287, 0.02985037, 0.02437406,
-0.02782 , 0.00378997, 0.01849808],
[-0.01667434, -0.00078654, -0.04029636, -0.04981862, 0.01762467,
0.06667487, 0.00302309, 0.02881355],
[ 0.04509508, -0.01994639, 0.01837089, -0.00047283, 0.01141069,
-0.06225454, 0.01198813, 0.02102971],
[ 0.05014603, 0.04591557, -0.03119368, 0.04181939, 0.02837115,
-0.01640332, 0.0577693 , 0.01364574],
[ 0.01948108, -0.04200416, -0.06589368, -0.05397511, 0.02729052,
0.04164972, -0.03795817, -0.06763416],
[ 0.01284658, 0.05563928, -0.026766 , 0.03231764, -0.0441488 ,
-0.02879154, 0.02092744, 0.01947528]], dtype=float32)]
那么我如何从第 8 个向量(行)矩阵中获取我的 7 个词向量,例如 {1: 'work'...},第 8 个向量是什么意思? 如果我更改 vocab_size = len(tokenizer.index_word) - 不添加 (+1) 然后在尝试拟合模型时出现尺寸错误等问题
Embedding
图层在hood下使用tf.nn.embedding_lookup
,默认为zero-based。例如:
import tensorflow as tf
import numpy as np
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tf.random.set_seed(111)
# Create integer embeddings for demonstration purposes.
embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
print(padded_seq)
tf.nn.embedding_lookup(embeddings, padded_seq)
[[2 3]
[4 1]
[5 6]
[7 1]]
<tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
array([[[17., 11., 10., 16., 17., 16., 16., 17.],
[18., 15., 13., 13., 18., 18., 10., 16.]],
[[17., 16., 13., 12., 13., 15., 19., 14.],
[12., 15., 12., 15., 10., 19., 15., 12.]],
[[18., 15., 11., 13., 13., 13., 16., 10.],
[18., 18., 11., 12., 10., 13., 14., 10.]],
--> [[ 0., 0., 0., 0., 0., 0., 0., 0.] <--,
[12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>
注意整数 7 是如何映射到零的,因为 tf.nn.embedding_lookup
只知道如何将值从 0 映射到 6。这就是原因,你应该使用 vocab_size = len(tokenizer.index_word)+1
,因为你想要整数 7 的有意义的向量:
embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
tf.nn.embedding_lookup(embeddings, padded_seq)
然后可以为未知标记保留索引 0,因为您的词汇表是从 1 开始的。