在对任何输入数据执行词嵌入后获得全零的嵌入矩阵
Getting embedding matrix of all zeros after performing word embedding on any input data
我正在尝试在 Keras 中进行词嵌入。为此,我正在使用 'glove.6B.50d.txt'。在从 "glove.6B.50d.txt" 文件准备嵌入索引之前,我能够获得正确的输出。
但是每当我将我提供的输入中的单词映射到嵌入索引中的单词时,我总是得到充满零的嵌入矩阵。
代码如下:
#here is the example sentence given as input
line="The quick brown fox jumped over the lazy dog"
line=line.split(" ")
#this is my embedding file
EMBEDDING_FILE='glove.6B.50d.txt'
embed_size = 10 # how big is each word vector
max_features = 10000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 10 # max number of words in a comment to use
tokenizer = Tokenizer(num_words=max_features,split=" ",char_level=False)
tokenizer.fit_on_texts(list(line))
list_tokenized_train = tokenizer.texts_to_sequences(line)
sequences = tokenizer.texts_to_sequences(line)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(sequences)
print(word_index)
print('Shape of data tensor:', X_t.shape)
#got correct output here as
# Found 8 unique tokens.
#[[1], [2], [3], [4], [5], [6], [1], [7], [8]]
#{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}
# Shape of data tensor: (9, 10)
#loading the embedding file to prepare embedding index matrix
embeddings_index = {}
for i in open(EMBEDDING_FILE, "rb"):
values = i.split()
word = values[0]
#print(word)
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))
#Found 400000 word vectors.
#making the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
在这里,当我打印嵌入矩阵时,我在其中得到了所有的零(即输入中没有一个单词被识别)。
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
此外,如果我为每次迭代打印 embeddings_index.get(word)
,它无法获取单词和 returns NONE。
我的代码哪里出错了?
- embed size应该是50而不是10(表示词嵌入的维度)
- 特征数量应该>>50(使其接近10,000)。将其限制为 50 意味着将丢失大量向量
今天问题解决了。
似乎 embeddings_index.get(word)
由于某些编码问题无法获取单词。
我把准备嵌入矩阵时出现的for i in open(EMBEDDING_FILE, "rb"):
改成了for i in open(EMBEDDING_FILE, 'r', encoding='utf-8'):
这解决了问题。
我正在尝试在 Keras 中进行词嵌入。为此,我正在使用 'glove.6B.50d.txt'。在从 "glove.6B.50d.txt" 文件准备嵌入索引之前,我能够获得正确的输出。
但是每当我将我提供的输入中的单词映射到嵌入索引中的单词时,我总是得到充满零的嵌入矩阵。
代码如下:
#here is the example sentence given as input
line="The quick brown fox jumped over the lazy dog"
line=line.split(" ")
#this is my embedding file
EMBEDDING_FILE='glove.6B.50d.txt'
embed_size = 10 # how big is each word vector
max_features = 10000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 10 # max number of words in a comment to use
tokenizer = Tokenizer(num_words=max_features,split=" ",char_level=False)
tokenizer.fit_on_texts(list(line))
list_tokenized_train = tokenizer.texts_to_sequences(line)
sequences = tokenizer.texts_to_sequences(line)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(sequences)
print(word_index)
print('Shape of data tensor:', X_t.shape)
#got correct output here as
# Found 8 unique tokens.
#[[1], [2], [3], [4], [5], [6], [1], [7], [8]]
#{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}
# Shape of data tensor: (9, 10)
#loading the embedding file to prepare embedding index matrix
embeddings_index = {}
for i in open(EMBEDDING_FILE, "rb"):
values = i.split()
word = values[0]
#print(word)
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))
#Found 400000 word vectors.
#making the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
在这里,当我打印嵌入矩阵时,我在其中得到了所有的零(即输入中没有一个单词被识别)。
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
此外,如果我为每次迭代打印 embeddings_index.get(word)
,它无法获取单词和 returns NONE。
我的代码哪里出错了?
- embed size应该是50而不是10(表示词嵌入的维度)
- 特征数量应该>>50(使其接近10,000)。将其限制为 50 意味着将丢失大量向量
今天问题解决了。
似乎 embeddings_index.get(word)
由于某些编码问题无法获取单词。
我把准备嵌入矩阵时出现的for i in open(EMBEDDING_FILE, "rb"):
改成了for i in open(EMBEDDING_FILE, 'r', encoding='utf-8'):
这解决了问题。