使用带有 text2vec 的预训练模型？

Question

我想使用带有 text2vec 的预训练模型。我的理解是，这里的好处是这些模型已经在大量数据上进行了训练，例如Google News Model.

阅读 text2vec documentation 看起来入门代码读取文本数据然后用它训练模型：

library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
  unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)

文档然后继续展示如何创建标记和词汇：

# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

那么，这看起来像是拟合模型的步骤：

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

我的问题是，众所周知的 Google 预训练 word2vec 模型是否可以在这里使用，而不需要依赖我自己的词汇或我自己的本地设备来训练模型？如果是，我如何读入并在 r 中使用它？

我想我误解了或遗漏了什么？我可以使用 text2vec 完成这项任务吗？

Answer 1

目前 text2vec 不为 downloading/manipulating 预训练词嵌入提供任何功能。我有一份将此类实用程序添加到下一个版本的草稿。

但另一方面，您只需使用标准 R 工具即可轻松手动完成。例如这里是如何读取 fasttext 向量：

con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
con = gzcon(con)
wv = readLines(con, n = 10)

然后你只需要解析它 - strsplit 和 rbind 是你的朋友。

Answer 2

这有点晚了，但其他用户可能会感兴趣。 Taylor Van Anne 在此处提供了一个小教程，介绍如何使用带有 text2vec 的预训练 GloVe 向量模型： https://gist.github.com/tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9

使用带有 text2vec 的预训练模型？

Use a pre trained model with text2vec?

nlp

r

word2vec

text2vec