为什么在 Keras 中嵌入层的矩阵大小为 vocab_size + 1?

Why in Keras embedding layer's matrix is a size of vocab_size + 1?

我有下面的玩具示例,其中我的词汇量大小为 7,嵌入大小为 8,但 Keras 嵌入层的权重输出为 8x8。 (?) 那个怎么样?这似乎与与 Keras 嵌入层相关的其他问题有关,即“最大整数索引 + 1”,我已经阅读了所有其他关于此的 Whosebug 查询,但所有这些都表明它不是 vocab_size + 1 而我的代码告诉我它是。 我问这个是因为我需要知道哪个嵌入向量与哪个词相关。

docs = ['Well done!',
            'Good work',
            'Great effort',
            'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word

{1: 'work', 2: 'well', 3: 'done', 4: 'good', 5: 'great', 6: 'effort', 7: 'nice'}

    len(tokenizer.index_word) # 7
    vocab_size = len(tokenizer.index_word)+1 
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
    model.add(Flatten())
    model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_lay (Embedding)    (None, 2, 8)              64        
_________________________________________________________________
flatten_1 (Flatten)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0

model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()

[array([[-0.0389936 , -0.0294274 ,  0.02361362,  0.01885288, -0.01246006,
         -0.01004354,  0.01321061, -0.02298149],
        [-0.01264734, -0.02058442,  0.0114141 , -0.02725944, -0.06267354,
          0.05148344, -0.02335678, -0.06039589],
        [ 0.0582506 ,  0.00020944, -0.04691287,  0.02985037,  0.02437406,
         -0.02782   ,  0.00378997,  0.01849808],
        [-0.01667434, -0.00078654, -0.04029636, -0.04981862,  0.01762467,
          0.06667487,  0.00302309,  0.02881355],
        [ 0.04509508, -0.01994639,  0.01837089, -0.00047283,  0.01141069,
         -0.06225454,  0.01198813,  0.02102971],
        [ 0.05014603,  0.04591557, -0.03119368,  0.04181939,  0.02837115,
         -0.01640332,  0.0577693 ,  0.01364574],
        [ 0.01948108, -0.04200416, -0.06589368, -0.05397511,  0.02729052,
          0.04164972, -0.03795817, -0.06763416],
        [ 0.01284658,  0.05563928, -0.026766  ,  0.03231764, -0.0441488 ,
         -0.02879154,  0.02092744,  0.01947528]], dtype=float32)]

那么我如何从第 8 个向量(行)矩阵中获取我的 7 个词向量,例如 {1: 'work'...},第 8 个向量是什么意思? 如果我更改 vocab_size = len(tokenizer.index_word) - 不添加 (+1) 然后在尝试拟合模型时出现尺寸错误等问题

Embedding图层在hood下使用tf.nn.embedding_lookup,默认为zero-based。例如:

import tensorflow as tf
import numpy as np

docs = ['Well done!',
            'Good work',
            'Great effort',
            'nice work']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8

tf.random.set_seed(111)

# Create integer embeddings for demonstration purposes.
embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)

print(padded_seq)

tf.nn.embedding_lookup(embeddings, padded_seq)
[[2 3]
 [4 1]
 [5 6]
 [7 1]]
<tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
array([[[17., 11., 10., 16., 17., 16., 16., 17.],
        [18., 15., 13., 13., 18., 18., 10., 16.]],

       [[17., 16., 13., 12., 13., 15., 19., 14.],
        [12., 15., 12., 15., 10., 19., 15., 12.]],

       [[18., 15., 11., 13., 13., 13., 16., 10.],
        [18., 18., 11., 12., 10., 13., 14., 10.]],

    --> [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.] <--,
        [12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>

注意整数 7 是如何映射到零的,因为 tf.nn.embedding_lookup 只知道如何将值从 0 映射到 6。这就是原因,你应该使用 vocab_size = len(tokenizer.index_word)+1,因为你想要整数 7 的有意义的向量:

embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)

tf.nn.embedding_lookup(embeddings, padded_seq)

然后可以为未知标记保留索引 0,因为您的词汇表是从 1 开始的。