从 h2o.word2vec 对象中提取每个词的嵌入向量

Question

我正在尝试使用 h2o.word2vec 创建一个预训练的嵌入层，我希望提取模型中的每个单词及其等效的嵌入向量。

代码：

library(data.table)
library(h2o)
h2o.init(nthreads = -1)

comment <- data.table(comments='ExplanationWhy the edits made under my username Hardcore Metallica 
                      Fan were reverted They werent vandalisms just closure on some GAs after I voted 
                      at New York Dolls FAC And please dont remove the template from the talk page since Im retired now')

comments.hex <- as.h2o(comment, destination_frame = "comments.hex", col.types=c("String"))

words <- h2o.tokenize(comments.hex$comments, "\\W+")

vectors <- 3 # Only 10 vectors to save time & memory
w2v.model <- h2o.word2vec(words
                          , model_id = "w2v_model"
                          , vec_size = vectors
                          , min_word_freq = 1
                          , window_size = 2
                          , init_learning_rate = 0.025
                          , sent_sample_rate = 0
                          , epochs = 1) # only a one epoch to save time
print(h2o.findSynonyms(w2v.model, "the",2))

h2o API 使我能够得到两个词的余弦，但我只是想得到我词汇表中每个作品的向量，我怎样才能得到它？在 API 中找不到任何简单的方法来提供

提前致谢

Answer 1

你可以使用方法w2v_model.transform(words=words)

（完整选项为：w2v_model.transform(words =, aggregate_method =)

其中 words 是由包含源词的单列组成的 H2O 框架（请注意，您可以指定包含此框架的子集）并且 aggregate_method 指定如何聚合词序列.

如果不指定聚合方法，则不执行聚合，每个输入词映射到单个词向量。如果方法是 AVERAGE，则输入被视为由 NA 分隔的单词序列。

例如：

av_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

从 h2o.word2vec 对象中提取每个词的嵌入向量

Extract embedded vecor per word from h2o.word2vec object

python

r

nlp

h2o