Tensorflow embeddings InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905) [[node sequential_1/embedding_1/embedding_lookup
Tensorflow embeddings InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905) [[node sequential_1/embedding_1/embedding_lookup
我正在使用 TF 2.2.0 并尝试创建 Word2Vec CNN 文本分类模型。但是无论我如何尝试,模型或嵌入层总是存在问题。我在互联网上找不到明确的解决方案,所以决定问一下。
import multiprocessing
modelW2V = gensim.models.Word2Vec(filtered_stopwords_list, size= 100, min_count = 5, window = 5, sg=0, iter = 10, workers= multiprocessing.cpu_count() - 1)
model_save_location = "3000tweets_notbinary"
modelW2V.wv.save_word2vec_format(model_save_location)
word2vec = {}
with open('3000tweets_notbinary', encoding='UTF-8') as f:
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
word2vec[word] = vec
num_words = len(list(tokenizer.word_index))
embedding_matrix = np.random.uniform(-1, 1, (num_words, 100))
for word, i in tokenizer.word_index.items():
if i < num_words:
embedding_vector = word2vec.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
else:
embedding_matrix[i] = np.zeros((100,))
我通过上面的代码创建了我的 word2vec 权重,然后将其转换为 embedding_matrix,正如我在许多教程中所遵循的那样。但是由于 word2vec 看到了很多单词但在嵌入中不可用,如果没有嵌入我分配 0 向量。然后将数据和嵌入到 tf 序列模型中。
seq_leng = max_tokens
vocab_size = num_words
embedding_dim = 100
filter_sizes = [3, 4, 5]
num_filters = 512
drop = 0.5
epochs = 5
batch_size = 32
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(input_dim= vocab_size,
output_dim= embedding_dim,
weights = [embedding_matrix],
input_length= max_tokens,
trainable= False),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(2),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(),
tf.keras.layers.Dropout(drop),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(32, activation= "relu", kernel_regularizer= tf.keras.regularizers.l2(1e-4)),
tf.keras.layers.Dense(3, activation= "softmax")
])
model.compile(loss= "categorical_crossentropy", optimizer= tf.keras.optimizers.Adam(learning_rate= 0.001, epsilon= 1e-06),
metrics= ["accuracy", tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])
model.summary()
history = model.fit(x_train_pad, y_train2, batch_size= 60, epochs= epochs, shuffle= True, verbose= 1)
但是当我 运行 这段代码时,tensorflow 在训练过程的任何随机时间给我以下错误。但我找不到任何解决办法。我尝试将 + 1 添加到 vocab_size,但是当我这样做时,我得到了大小不匹配错误,这甚至不允许我编译我的模型。谁能帮帮我?
InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905)
[[node sequential_1/embedding_1/embedding_lookup (defined at <ipython-input-26-ef1b16cf85bf>:1) ]] [Op:__inference_train_function_1533]
Errors may have originated from an input operation.
Input Source operations connected to node sequential_1/embedding_1/embedding_lookup:
sequential_1/embedding_1/embedding_lookup/991 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
我解决了这个问题。按照其他人的建议,我通过 vocab_size + 1 向 vocab_size 添加了一个新维度。但是,由于层尺寸和嵌入矩阵的大小不匹配,我遇到了这个问题。我在嵌入矩阵的末尾添加了一个零向量,解决了这个问题。
我正在使用 TF 2.2.0 并尝试创建 Word2Vec CNN 文本分类模型。但是无论我如何尝试,模型或嵌入层总是存在问题。我在互联网上找不到明确的解决方案,所以决定问一下。
import multiprocessing
modelW2V = gensim.models.Word2Vec(filtered_stopwords_list, size= 100, min_count = 5, window = 5, sg=0, iter = 10, workers= multiprocessing.cpu_count() - 1)
model_save_location = "3000tweets_notbinary"
modelW2V.wv.save_word2vec_format(model_save_location)
word2vec = {}
with open('3000tweets_notbinary', encoding='UTF-8') as f:
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
word2vec[word] = vec
num_words = len(list(tokenizer.word_index))
embedding_matrix = np.random.uniform(-1, 1, (num_words, 100))
for word, i in tokenizer.word_index.items():
if i < num_words:
embedding_vector = word2vec.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
else:
embedding_matrix[i] = np.zeros((100,))
我通过上面的代码创建了我的 word2vec 权重,然后将其转换为 embedding_matrix,正如我在许多教程中所遵循的那样。但是由于 word2vec 看到了很多单词但在嵌入中不可用,如果没有嵌入我分配 0 向量。然后将数据和嵌入到 tf 序列模型中。
seq_leng = max_tokens
vocab_size = num_words
embedding_dim = 100
filter_sizes = [3, 4, 5]
num_filters = 512
drop = 0.5
epochs = 5
batch_size = 32
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(input_dim= vocab_size,
output_dim= embedding_dim,
weights = [embedding_matrix],
input_length= max_tokens,
trainable= False),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(2),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(),
tf.keras.layers.Dropout(drop),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(32, activation= "relu", kernel_regularizer= tf.keras.regularizers.l2(1e-4)),
tf.keras.layers.Dense(3, activation= "softmax")
])
model.compile(loss= "categorical_crossentropy", optimizer= tf.keras.optimizers.Adam(learning_rate= 0.001, epsilon= 1e-06),
metrics= ["accuracy", tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])
model.summary()
history = model.fit(x_train_pad, y_train2, batch_size= 60, epochs= epochs, shuffle= True, verbose= 1)
但是当我 运行 这段代码时,tensorflow 在训练过程的任何随机时间给我以下错误。但我找不到任何解决办法。我尝试将 + 1 添加到 vocab_size,但是当我这样做时,我得到了大小不匹配错误,这甚至不允许我编译我的模型。谁能帮帮我?
InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905)
[[node sequential_1/embedding_1/embedding_lookup (defined at <ipython-input-26-ef1b16cf85bf>:1) ]] [Op:__inference_train_function_1533]
Errors may have originated from an input operation.
Input Source operations connected to node sequential_1/embedding_1/embedding_lookup:
sequential_1/embedding_1/embedding_lookup/991 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
我解决了这个问题。按照其他人的建议,我通过 vocab_size + 1 向 vocab_size 添加了一个新维度。但是,由于层尺寸和嵌入矩阵的大小不匹配,我遇到了这个问题。我在嵌入矩阵的末尾添加了一个零向量,解决了这个问题。