如何从长格式文本数据创建文档术语关联矩阵？

Question

我有这样的数据：

ID	word
1	blue
1	red
1	green
1	yellow
2	blue
2	purple
2	orange
2	green

但我想将它们转换为二进制关联矩阵，表示某个单词是否出现在某个文档 ID 中。换句话说，我想创建一个如下所示的矩阵：

ID	blue	red	green	yellow	purple	orange
1	1	1	1	1	0	0
2	1	0	1	0	1	1

有没有办法用 tm 包做到这一点？我想也许使用 DocumentTermMatrix() 会起作用，因为我不认为我的语料库中的任何单词在单个文档中有多次出现，但我尝试过的所有内容都返回了关于函数与对象 [=23 不兼容的错误消息=] data.frame

Answer 1

可能的解决方案，基于tidyr::pivot_wider：

library(tidyverse)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
  word = c("blue","red", "green","yellow","blue","purple","orange","green")
)

df %>% 
  pivot_wider(ID, names_from = word, values_from = word,
       values_fn = length, values_fill = 0)

#> # A tibble: 2 × 7
#>      ID  blue   red green yellow purple orange
#>   <int> <int> <int> <int>  <int>  <int>  <int>
#> 1     1     1     1     1      1      0      0
#> 2     2     1     0     1      0      1      1

Answer 2

如果您想对运行监督或非监督机器学习模型执行此操作，您应该直接将整洁的数据帧转换为 document-feature-matrix (dfm)。 dfms 是一个 class 的稀疏矩阵，可以有效地用于这些任务。为此，您可以使用 tidytext 中的 cast_dfm。但是你得先统计每个单词出现的次数。

library(tidyverse)
library(tidytext)

df <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
  word = c("blue","red", "green","yellow","blue","purple","orange","green")
)

df %>% 
  count(ID, word) %>% 
  cast_dfm(ID, word, n)
#> Document-feature matrix of: 2 documents, 6 features (33.33% sparse) and 0 docvars.
#>     features
#> docs blue green red yellow orange purple
#>    1    1     1   1      1      0      0
#>    2    1     1   0      0      1      1

^{由 reprex package (v2.0.1)}

创建于 2022-02-12

您可以使用 quanteda::convert(x, to = "data.frame") 将此对象转换回数据框，但如果您运行一个 class 化任务，直接使用它会更有意义。

如何从长格式文本数据创建文档术语关联矩阵？

How to create a document term incidence matrix from long format text data?

nlp

r

data-processing

tm