如何将每个词的出现表示为 R 中单独的 tcm 向量？

Question

我正在寻找一种高效方法来为语料库中的（每个）目标词创建术语共现矩阵，这样该词的每次出现都将构成其tcm 中自己的向量（行），其中列是上下文词（即，基于标记的共现模型）。这与向量语义中使用的更常见的方法形成对比，在向量语义中，每个术语（类型）在对称 tcm 中获得一行和一列，并且值在类型标记的（共同）出现中聚合。

显然，这可以使用基本 R 功能从头开始完成，或者通过过滤由执行这些操作的现有软件包之一生成的 tcm 来破解，但我正在处理的语料库数据相当大（数百万字） ) - 并且已经有很好的 corpus/NLP 包可用于 R 有效地完成这类任务并将结果存储在内存友好的稀疏矩阵中 - 例如 text2vec （函数 tcm）， quanteda (fcm) 和 tidytext (cast_dtm)。因此，尝试重新发明轮子（在迭代器、散列等方面）似乎没有意义。但是我也找不到一种直接的方法来使用这些中的任何一个来创建基于令牌的 tcm；因此这个问题。

最小示例：

  library(text2vec)
  library(Matrix)
  library(magrittr)

  # default approach to tcm with text2vec:
  corpus = strsplit(c("here is a short document", "here is a different short document"), " ")
  it = itoken(corpus) 
  tcm = create_vocabulary(it)  %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))

  # results in this:
  print(as.matrix(forceSymmetric(tcm, "U")))

            different here short document is a
  different         0    0     1        1  1 1
  here              0    0     0        0  2 2
  short             1    0     0        2  1 2
  document          1    0     2        0  0 1
  is                1    2     1        0  0 2
  a                 1    2     2        1  2 0

尝试获取基于标记的模型，针对目标词 "short":

  i=0
  corpus = lapply(corpus, function(x) 
   ifelse(x == "short", {i<<-i+1;paste0("short", i)}, x  ) 
   ) # appends index to each occurrence so itoken distinguishes them
  it = itoken(corpus) 
  tcm = create_vocabulary(it)  %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))
  attempt = as.matrix(forceSymmetric(tcm, "U") %>% 
   .[grep("^short", rownames(.)), -grep("^short", colnames(.))] 
   ) # filters the resulting full tcm

  # yields intended result but is hacky/slow:
  print(attempt)

         different here document is a
  short2         1    0        1  0 1
  short1         0    0        1  1 1

什么是 better/faster 替代方法来派生像上一个示例中的基于令牌的 tcm？（可能使用已经执行基于类型的 tcms 的 R 包之一）

Answer 1

quanteda 的 fcm 是一种非常有效的方法，可以在文档级别或用户定义的上下文中创建特征共现矩阵。这会产生一个稀疏的、对称的逐个特征矩阵。但听起来您希望每个独特的功能都属于自己的行，并在其周围有其目标词。

从示例中可以看出，您需要 +/- 2 个词的上下文 window，所以我已经为目标词 "short" 做了这个。

首先，我们使用上下文中的关键字获取上下文：

library("quanteda")
txt <- c("here is a short document", "here is a different short document")

(shortkwic <- kwic(txt, "short", window = 2))
#                                          
# [text1, 4]        is a | short | document
# [text2, 5] a different | short | document

然后根据上下文创建语料库，关键字作为唯一的文档名称：

shortcorp <- corpus(shortkwic, split_context = FALSE, extract_keyword = TRUE)
docnames(shortcorp) <- make.unique(docvars(shortcorp, "keyword"))
texts(shortcorp)
#                 short                      short.1 
# "is a short document" "a different short document"

然后创建一个 dfm，选择所有单词，但删除目标：

dfm(shortcorp) %>%
  dfm_select(dfm(txt)) %>%
  dfm_remove("short")
# Document-feature matrix of: 2 documents, 5 features (40% sparse).
# 2 x 5 sparse Matrix of class "dfm"
#          features
# docs      here is a document different
#   short      0  1 1        1         0
#   short.1    0  0 1        1         1

如何将每个词的出现表示为 R 中单独的 tcm 向量？

How to represent each word occurrence as a separate tcm vector in R?

r

sparse-matrix

quanteda

text2vec

tidytext